人类理解机器如何理解语言
Humans Understanding How Machines Understand Language
版权所有 © 2022 O'Reilly Media。保留所有权利。
Copyright © 2022 O’Reilly Media. All rights reserved.
在美国印刷。
Printed in the United States of America.
由O'Reilly Media, Inc. 出版 ,地址: 1005 Gravenstein Highway North, Sebastopol, CA 95472。
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.
O'Reilly 的图书可用于教育、商业或促销用途。大多数图书还提供在线版本 ( http://oreilly.com )。如需了解更多信息,请联系我们的企业/机构销售部门:800-998-9938 或corporate@oreilly.com。
O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com.
O'Reilly 徽标是 O'Reilly Media, Inc. 的注册商标。 简易英语语言模型、封面图片和相关商业外观是 O'Reilly Media, Inc. 的商标。
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Language Models in Plain English, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc.
本作品中表达的观点均为作者的观点,不代表出版商的观点。尽管出版商和作者已尽最大努力确保本作品中包含的信息和说明的准确性,但出版商和作者对任何错误或遗漏概不负责,包括但不限于因使用或依赖本作品而造成的损害。使用本作品中包含的信息和说明的风险由您自行承担。如果本作品包含或描述的任何代码示例或其他技术受开源许可证或他人知识产权的约束,您有责任确保您对其的使用符合此类许可证和/或权利。
The views expressed in this work are those of the authors, and do not represent the publisher’s views. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.
这项工作是 O'Reilly 与 IBM 合作的一部分。请参阅我们的编辑独立声明。
This work is part of a collaboration between O’Reilly and IBM. See our statement of editorial independence.
978-1-098-10904-2
978-1-098-10904-2
[大规模集成电路]
[LSI]
成为一只蝙蝠是什么感觉?
What is it like to be a bat?
哲学家托马斯·内格尔在他1974年关于意识的论文中提出了这个问题。1他的立场是,答案是不可知的。如果我想象自己有蹼状的手臂和糟糕的视力,依靠声纳感知世界,以昆虫为食,整天倒挂着,“它只能告诉我,如果我的行为像蝙蝠一样,会是什么样子。” 但如果我试图想象蝙蝠作为蝙蝠会是什么样子,受限于我有限的思维和经验,这是不可能的。
The philosopher Thomas Nagel asks this question in his 1974 essay on consciousness.1 His position is that the answer is unknowable. If I imagine that I have webbed arms and poor vision, perceive the world by sonar, subsist on a diet of insects, and spend the day hanging upside down, “it tells me only what it would be like for me to behave as a bat behaves.” But if I try to imagine what it’s like for a bat to be a bat, my restrictions to the limited range of my own mind and experiences render this impossible.
至少在撰写本文时,人类和蝙蝠还没有共同的语言。另一方面,世界上存在着无数的人工智能模型——其中许多是专门为用我们自己的语言与我们交流而创建的。
Humans and bats, at least at the time of this writing, have no shared language. On the other hand, countless AI models exist in the world—many of which have been created specifically to communicate something to us in our own language.
近年来,机器学习的爆炸式发展带来了大量有趣、强大且日益晦涩难懂的模型。与此同时,人工智能的民主化进程也降低了成为数据科学家和使用机器学习模型的门槛。在现实世界中部署模型变得非常简单,无需担心解释模型的输出,也无需探究基于该输出做出的决策的伦理影响。创建和使用机器学习模型的难易程度正在提升;而理解机器学习模型功能的难易程度却在下降(图 1-1)。
The recent explosion in advances in machine learning has brought a myriad of interesting, powerful, and increasingly opaque models. Simultaneously, the recent movement toward democratization of AI has lowered the barriers for being a data scientist and using machine learning models. It is simple to deploy a model in the real world without being concerned about explaining the output or without exploring the ethical implications of decisions that will be made on the basis of that output. The ease of creating and using machine learning models is going up; the ease of understanding what machine learning models are doing is going down (Figure 1-1).
我们每天都在使用各种各样的科技进步,却不想也不需要了解它们的工作原理。你或许不知道你的汽车、烤面包机,甚至你的房子是如何组装起来的,但你至少对运动、电力和建筑有着大致的直觉。你(大多数情况下)不会把它们拟人化,以至于指责你的烤面包机容易烤焦面包,或者怀疑你的车里的空调在特别炎热的天气里故意不工作。当你观察到这些现象时,你就知道该更改烤面包机的设置,或者该把车送去检查了。
We use a myriad of technological advances every day without wanting or needing to understand the details of how they work. You may not know exactly how your car, your toaster, or even your house was put together, but you have at least a general intuition of motion, electricity, and architecture. You (mostly) do not anthropomorphize them to the extent of accusing your toaster of being biased toward burning your bread, or suspecting your car’s air conditioner of deliberately refusing to work on especially hot days. When you observe such effects, you know to change your toaster settings or to take your car in for a checkup.
人类与人工智能的关系远比与汽车和家用电器的关系更早。我们每天都在使用人工智能,却不想也不需要了解其运作细节——搜索、通信、视觉、自动化等等。然而,我们尚未对人工智能的一般功能形成一种舒适的直觉。与烤面包机不同,我们很难避免将人工智能拟人化——而语言模型 (LM) 就是其中最具挑战性的例子之一,因为语言输出是需要被解读的。
The human race is in a much earlier stage of its relationship with AI than it is with cars and appliances. We use AI every day without wanting or needing to understand the details of how it works—search, communication, vision, automation, the list goes on. However, we have not yet settled into a comfortable intuition of the general functionality. Unlike with toasters, we struggle to keep from anthropomorphizing AI—and language models (LM) are one of the more challenging examples of this, as language output is made to be interpreted.
本报告不会教授,甚至不会深入探讨如何运行和部署语言模型。目前有很多优秀的实用资源可供参考,涵盖各个层次。2您可能从未亲自运行过语言模型(或包含语言模型组件的系统)。或者,您可能正在定期创建、训练和运行语言模型,或者在下游应用程序中使用它们。无论您属于哪种情况,我们都会进行以下假设:
This report is not going to teach, or even get into, how to run and deploy language models. Many wonderful practical resources are available for this, at various levels of granularity.2 You may have never personally run a language model (or a system with a language model component). Alternatively, you may be creating, training, and running language models on a regular basis, or consuming them in downstream applications. Wherever you fall in that range, we suppose the following:
首先让我们以批判的眼光来看待什么是语言模型、它做什么以及它如何处理语言。
Let’s begin by turning a critical eye to what a language model is, what it does, and how it approaches language.
语言模型是我们许多技术体验的根源。在很多情况下,你正在与语言模型进行交互:
Language models are at the root of many of our experiences with technology. Among many, many examples, you are interacting with a language model when:
语言模型是一种计算特定单词序列出现概率的技术。它试图模拟某些人类语言能力,学习单词之间大量的关联,并根据当前任务来表示你自己的语言模式、其他人的语言模式、一般英语的语言模式等等。
A language model is a technique for calculating the probability of a particular sequence of words occurring. It tries to emulate certain human linguistic capabilities, learning a myriad of associations between words to represent, depending on the task at hand, your own language patterns, the patterns of a set of other people, the patterns of English in general, and so on.
但请记住,人类无法通过模仿蝙蝠的行为来了解蝙蝠的感受。同样,语言模型也无法通过模仿人类的语言互动来了解人类的感受。语言模型只能在模型能力范围内学习如果自己像人类一样行事会是什么样子。由于本报告是为人类读者(而非语言模型)撰写的,因此我们将在报告的其余部分提供一些视角,阐述人类应该如何看待语言模型在尝试像人类一样行事时所做的事情。
But remember, a human will not know what it is like to be a bat through imitating bat behavior. Likewise, an LM will not know what it is like to be a human through imitating human interaction with language. An LM learns only what it would be like if it behaved like a human, within the model’s capabilities. And since this report is written for a human audience (not for LMs), we will spend the rest of the report providing some perspective on how humans should think about what an LM is doing when it is trying to behave like a human.
那么,LM 能做什么呢?想象一下即兴喜剧中流行的两个人之间的简单游戏。一个人先说出一个单词,第二个人说出接下来的单词,第一个人说出第三个单词,依此类推:“我”、“去了”、“去”、“这个”、“商场”、“和”、“发现”、“一个”、“巨大的”、“骆驼”、“那个”、“是”、“紫色”。游戏的目标通常是看谁先笑出来,或者谁发现自己无法继续游戏。生成的故事方向取决于玩家的背景知识、他们之前的游戏经验、他们的目标(尽可能长时间地玩下去?让对方发笑?)以及许多其他微妙而多变的方面。
So what do LMs do? Consider a simple game between two people, popular in improv comedy. One person begins by throwing out a single word, the second person gives the next word to follow, the first person gives the third word, and so on: “I,” “went,” “to,” “the,” “mall,” “and,” “found,” “a,” “stupendous,” “llama,” “that,” “was,” “purple.” The object is often to see who is first to laugh or to find themselves unable to continue the game. The direction the generated story takes depends on the background knowledge of the players, their previous experience with the game, their goals (to keep going for as long as possible? To cause the other person to laugh?), and many other subtle and variable aspects.
简单来说,每次调用一个经过训练的语言模型,它都会进行下一轮游戏。提示可能是需要继续的句子的一部分(“我去了”),也可能是需要翻译的内容(另一种语言的语音或文本)。目标是针对给定的提示输出最佳的单词,而模型选择的最佳输出取决于它的训练方式。
In the simplest terms, each time a trained language model is invoked, it’s playing the next round of this game. The prompt may be something like a part of a sentence to be continued (“I went to the”), or something to be translated (speech or text in a different language). The object is to output the best possible word(s) for the given prompt, and what the model chooses as the best output depends on how it gets trained.
该模型通过“预测下一个单词”的游戏进行训练,其中所有最佳猜测都是已知的(句子或翻译已经存在)。在每一轮中,模型都会与自己竞争,试图正确猜出下一个单词,并调整其视角以更接近现有文本。
The model is trained by playing the game of “predict the next word,” where all the best guesses are already known (the sentences or translations already exist). In each round, the model competes with itself, trying to guess the next word correctly, and tweaking its point of view to get closer to the existing text.
语言模型 (LM) 是一种计算特定单词序列出现概率的技术。简而言之,该模型始终在进行“预测下一个单词”的游戏。
A language model (LM) is a technique for calculating the probability of a particular sequence of words occurring. In plain terms, the model is always playing the game of “predict the next word.”
为了理解 LM 如何处理语言,我们首先考虑一下人们如何看待语言。
To understand how an LM approaches language, let’s first consider how people think about language.
一个流利的英语使用者会期望遇到“red book”而不是“book red”,“hot dog”而不是“cold mouse”,“early bird”而不是“late bird”。当被问及原因时,人类或许能够清楚地解释这些例子的原因:在英语中,形容词通常位于名词之前,“hot dog”是描述流行食物的短语,而不仅仅是“过热的狗”的同义词,“early bird”是对早起者或早到者的描述,也是著名谚语的一部分。然而,人类最简单的回答是,你“只看到”常见的短语,而“只是看不到”不常见的短语。人类反应的这种大杂烩涉及语言的两个组成部分:结构和含义。
A fluent English speaker expects to come across “red book” rather than “book red,” and “hot dog” rather than “cold mouse,” and “early bird” rather than “late bird.” When asked why this is, a human may possibly be able to articulate the reasons for these examples: in English, adjectives usually precede nouns, and a “hot dog” is a phrase describing a popular food and not just a synonym for “overheated canine,” and an “early bird” is a description of early risers or early arrivers, as well as part of a well-known proverb. However, a human’s simplest answer would be that you “just see” the common phrases, and “just don’t see” the uncommon ones. This hodgepodge of human responses draws on the two components of language: structure and meaning.
然而,语言模型只感知结构。它们完全专注于输出语法正确的内容——完全不考虑语法本身的存在。一个好的(英语)语言模型也能正确地将“red book”的概率排在“book red”的概率之上,等等,因为它已经“阅读”了足够多的英语文本,能够学习到哪些是常见的、哪些是预期的。3
However, LMs are aware of only structure. They are entirely focused on outputting something grammatically correct—without the notion that such a thing as grammar exists. A good language model (of English) will also correctly rank “red book” as more likely than “book red,” and so on, because it will have “read” enough English text to learn what is common and expected.3
但模型永远无法知道“热狗”的真正含义。事实上,模型既无法分别理解“热”和“狗”的含义,也无法理解词语本身所代表的意义。同样,语言模型也无法故意撒谎、混淆事实或顾及你的感受。当我们接触语言时,我们会体验到意义的幻觉,因为我们必须体验。语言模型则不会,因为它们无法体验。
But the model can never know what “hot dog” actually means. For that matter, the model does not know what “hot” and “dog” mean separately, nor is it able to grasp the concept of words as having meaning. In the same way, an LM cannot intentionally tell lies, obfuscate facts, or spare your feelings. When we encounter language, we experience the illusion of meaning because we must. LMs do not because they cannot.
语言模型不会做出判断,而是进行预测。语言模型并非真心实意地表达其所言。语言模型生成的是结构良好的语言,而人类却将其视为一种意义的幻觉。
Language models don’t make judgments; they make predictions. Language models do not mean what they say. Language models generate well-formed language, and humans experience it as an illusion of meaning.
语言模型(以及更广义的人工智能模型)的构建框架是“在特定任务上尽可能做得好”。语言模型仅根据训练过程中遇到的内容来“理解”一项任务:输入是这样的,输出就是那样的。进一步概括一下:输入是这样的,期望输出就是那样的。
Language models (and AI models more generally) are constructed through the framing of “becoming as good as possible” at a specific task. An LM “understands” a task only in terms of the content encountered during training for it: for input that is this, the output is that. To further generalize the statement: for input that is like this, desired output is like that.
计算所需的所有必要和充分信息的表示通常被称为特征 空间。完善计算以权衡这些信息,为未知输入做好准备的过程就是学习或训练的过程。此处的讨论较为简化,但它适用于本报告中讨论的所有其他内容,以及更广泛的人工智能领域。
The representation of all the necessary and sufficient information to calculate like is usually referred to as the feature space. Perfecting the calculation for weighing this information in preparation for unseen input is the process of learning, or training. This discussion is a simplification, but it applies to everything else discussed in this report as well as to artificial intelligence more broadly.
创建训练模型的过程是旅程
The process of creating a trained model is the journey
从
from
对于输入是这个,输出是那个
for input that is this, the output is that
到
to
对于这样的输入,期望的输出就是那样的
for input that is like this, desired output is like that
经过
by
代表计算所需的所有必要和充分信息。
representing all the necessary and sufficient information to calculate like.
正如我们所描述的,语言模型最直接的任务是学习预测下一个单词:对于一个包含K 个单词的输入序列,输出是语言模型认为最有可能出现的下一个单词W。4该模型会接触包含许多输入和输出示例的训练数据。5 (仅来自前一句的示例就包括输入“model”和输出“is”,输入“exposed to”和输出“training”,等等。)对于更高级的语言模型,输入比输出之前的几个单词更复杂,会结合来自句子其余部分甚至更远地方的额外上下文信息。
As we described, the most straightforward task for an LM is learning to predict the next word: for an input sequence of K words, the output is the word W that the language model deems the most likely to come next.4 The model is exposed to training data comprising many input and output examples.5 (Examples just from the preceding sentence include the input “model” with the output “is,” the input “exposed to” with the output “training,” and so on.) For more advanced LMs, the input is more complicated than the few words preceding the output, incorporating additional contextual information from the rest of the sentence or from even further away.
在这个学习过程中,模型会构建其表征来“理解”训练数据的语言,并不断调整该表征,直到它尽可能接近训练数据中的示例。想象一下你自己用抽认卡进行测试训练的经历,卡片一面是提示,另一面是答案。你通过对提示的回答与卡片另一面的匹配程度来检查自己对材料的掌握程度。语言模型正在从许多(许多!)这样的抽认卡中学习。但是,这个正在学习的表征究竟是什么呢?
During this learning process, the model constructs its representation to “understand” the language of the training data, and adjusts that representation until it comes as close as it can to matching the examples in the training data. Think of your own experience training for a test by using flash cards, with a prompt on one side and the answer on the other. You check your mastery of the material by how closely your answer to a prompt matches the other side of the card. The language model is learning from many (many!) such flash cards. But what exactly is this representation that is being learned?
语言模型使用三种通用方法来编码表示语言的信息:
Language models use three general approaches to encode the information that represents language:
该模型手动编码语法结构,这既困难又耗时,但同时也清晰且完全可解释。该模型将语言表示为一组规则。英语中的一条规则是,形容词通常位于其修饰的名词之前(“red book”,而不是“book red”),但也有例外,例如与某些动词一起使用时(“the cake tastes great”,而不是“the great cake tastes”或“the cake great tastes”)。
The model manually encodes the grammar structures, which is difficult and time-consuming to do, but is also explicit and fully explainable. The model represents language as a set of rules. One such rule for English would be that an adjective usually comes before the noun it is modifying (“red book,” not “book red”), though there are exceptions, such as when used with some verbs (“the cake tastes great,” not “the great cake tastes” or “the cake great tastes”).
该模型使用参考文本计算单词出现次数,并在玩“预测下一个单词”游戏时依赖这些计数。举一个非常简单的例子,一个二元语法(两个词)语言模型在参考文本中计算出单词“red”的五次出现次数,其中两次是“red book”,三次是“red shoes”。当被要求在提示“red”后预测下一个单词时,该模型将以五分之二的概率回答“book”,以五分之三的概率回答“shoes”。语言模型可以学习任意大小的单词序列(一个单词、两个单词、三个单词等)的概率。额外的统计技巧可以解决诸如处理以前未见过的序列之类的问题。该模型将语言表示为一组单词序列及其相关概率。与现代神经网络语言模型相比,统计语言模型更容易理解,而且目前其受欢迎程度正在下降。
The model, using a reference text, counts word occurrences, and relies on those counts when playing “predict the next word.” As an overly simple example, a bigram (two-word) language model counts five instances of the word “red” in the reference text, of which two are “red book” and three are “red shoes.” When asked to predict the next word after a prompt of “red,” the model will answer “book” with two-fifths probability and “shoes” with three-fifths probability. An LM can learn these probabilities for a word sequence of any size (one word, two words, three words, etc.). Additional statistical tricks can address issues such as dealing with previously unseen sequences. The model represents language as a set of word sequences and their associated probabilities. Compared to modern neural network language models, statistical language models are both far better understood and, at the current time, waning in popularity.
该模型将语言中的每个单词表示为大维空间中的向量。这是深度神经网络语言模型主要使用的表示形式。
The model represents every word in the language as a vector in a large dimensional space. This is the representation primarily used by deep neural network language models.
第三种方法,嵌入,既是最难理解的方法,也是大多数最先进的语言模型所使用的方法,所以我们现在将花一点额外的时间来讨论它。
The third approach, embeddings, is both the most difficult one to understand and the one used by most state-of-the-art language models, so we will now spend a little extra time with it.
考虑以下(众所周知的)类比:男人:女人::国王:?6或者,“男人”与“女人”的关系相当于“国王”与哪个 词的关系?如果你的英语足够流利,你很容易就能想到“女王”这个答案。你可以用以下几种方式解释你的推理:
Consider the following (well-known) analogy: man:woman::king:?6 Or, “man” is to “woman” as “king” is to what word? With reasonable fluency in English, you will have little difficulty coming up with the answer of “queen.” You might explain your reasoning in a couple of ways:
在你的脑海中,“男人”和“女人”之间的区别就像“国王”和“王后”之间的区别一样;同样,“男人”和“国王”之间的区别,就像“女人”和“王后”之间的区别一样。我们有词语来表示这些区别:“性别”和“皇室”。此外,添加任何其他词语都不会改变它们之间的关系:“高个子男人”就像“高个子国王”,就像“高个子女人”就像“高个子王后”一样——也就是说,它们之间的区别仍然只是“皇室”的概念。
In your mind, you have a concept of the difference between “man” and “woman” as being like the difference between “king” and “queen”; similarly, “man” and “king” are different, like “woman” and “queen” are different. We have words to represent these differences: “gender” and “royalty.” Furthermore, adding any other words does not change the relationships: a “tall man” is like a “tall king” in the exact same way that a “tall woman” is like a “tall queen”—that is to say, the difference between them is still only the concept of “royalty.”
批判地审视前面的句子,我们人类显然以一种循环的方式定义所有这些词语,彼此之间相互关联(当然,所有词语都是由其他词语定义的)。我们在脑海中保留着这些词语的一些表征,当被探究时,我们会用其他词语来描述这些表征的含义——例如皇室、性别等等概念。
Taking a critical eye to the previous sentences, it is clear that we as humans define all these words in a circular way, relative to each other (all words are, of course, defined by other words). We hold some representation of these words in our mind, and when probed, use other words to describe the meaning of that representation—the concepts of royalty, gender, and so on.
正如我们所讨论的,语言模型(LM)并不理解“意义”的概念。但在词嵌入的空间中,它们对差异以及物体与其他物体的相似性有着清晰且可量化的理解:
LMs, as we have discussed, do not grasp the idea of meaning. But in the space of word embeddings, they have a crisp and quantifiable idea of differences and of objects being like other objects:
确定一对词(“男人”→“女人”)之间的关系与第二对词(“国王”→“女王”)之间的关系的相似程度,相当于计算第一对词与第二对词之间的差异(嵌入空间中的大小和移动方向)的相似程度。
Determining to what extent a relationship between one pair of words (“man” → “woman”) is like the one between a second pair of words (“king” → “queen”) is equivalent to calculating how similar the differences (size and direction of movement in the embedding space) are between the words of the first pair and the words of the second.
图 1-2说明了 LM 如何将我们的类比中的四个单词视为嵌入空间中的点。
Figure 1-2 illustrates how an LM would see the four words of our analogy as points in an embedding space.
现在,我们可以将人类的理性大致转化为两种可能的路径,以得出这个类比的答案:
We can now loosely translate our human rationales into two potential paths to take to arrive at the answer to the analogy:
词向量的n维空间中的维度本身并没有任何含义。同样,模型也无法对词向量坐标之间的距离做出任何有意义的解释。
The dimensions in the n-dimensional space of word embeddings do not have any inherent meaning. Similarly, the model cannot put any meaningful interpretation on the distance between word coordinates.
最后再说几句关于词向量的话。现实世界比我们这个简化的四个单词的例子要复杂得多,这不足为奇。语言模型学习到的词向量基于它所训练的文本,而由此产生的编码关系有时会以奇怪且难以追踪的方式包含该文本的所有偏见和特质(记住,语言模型进行的是预测,而不是判断)。7我们将在报告结尾进一步讨论这些极其重要的考虑因素。
A few final words on embeddings. It should come as no surprise that the real world is messier than our simplified example with four words. The word embeddings that an LM learns are based on the text it is trained on, and the resulting encoded relationships will include, sometimes in strange and hard-to-track ways, all the biases and idiosyncrasies of that text (recall that LMs make predictions, not judgments).7 We further discuss these extremely important considerations toward the end of the report.
此外,许多词语具有多重含义,通常包含多个词性,例如既有动词又有名词,8这使得它们的向量表示可能存在歧义,而且肯定不像前面的例子那样清晰。而且,该嵌入空间的维度本身没有任何含义(记住,语言模型只处理结构,而不处理含义)。用词语描述这个纯数学空间的效果与用词语描述蝙蝠的效果相同——我们可以讨论倒挂,但我们实际上谈论的是完全不同的动物。
In addition, many words have multiple meanings, often encompassing multiple parts of speech, such as both verb and noun,8 rendering their vector representations potentially ambiguous and certainly not as clean as in the preceding example. And the dimensions of that embedding space do not have any inherent meaning (remember, LMs work only with structure, not meaning). Word descriptions of this purely mathematical space have the same effect as word descriptions of being a bat—we can discuss hanging upside down, but we’re really talking about a whole different animal.
正如我们之前提到的,语言模型 (LM) 有三种通用方法:(1) 基于语言学的手动方法;(2) 基于概率的统计方法;以及 (3) 基于嵌入的神经方法。本报告重点介绍最后一种。目前最先进的语言模型要么完全由神经网络构成,要么包含重要的神经架构组件。同时,与手动或统计语言模型相比,基于神经网络的语言模型 (LM) 更难理解。本报告的目标是让您获得这种直观的理解,而无需纠结于这些模型内部运作的数学原理。
As we’ve mentioned, LMs use three general approaches: (1) manual, based on linguistics; (2) statistical, based on probabilities; and (3) neural, based on embeddings. This report focuses on the last of these. The current state-of-the-art LMs are either entirely composed of neural networks or have a significant neural architecture component. At the same time, neural network–based LMs are far more difficult to get a comfortable sense of, compared to manual or statistical LMs. Our goal for this report is for you to gain this intuition without getting tangled in the mathematics of the inner workings of these models.
在接下来的章节中,我们将解释如何使用神经语言模型完成一些最常见、最重要的文本任务,包括文本摘要、翻译和阅读理解。我们将讨论人类和语言模型在处理这些任务时的实际差异。语言模型力求以其自身构造所能达到的最佳方式模拟人类生成语言的行为。因此,在本报告中,我们将比较人类和语言模型处理这些任务的方式,以便我们人类能够更好地理解机器如何理解语言。报告的其余部分安排如下:
In the following chapters, we explain how neural LMs are used in completing a few of the most common and important text tasks, including text summarization, translation, and reading comprehension. We discuss the practical differences between how humans and LMs approach these tasks. LMs are seeking to emulate the behavior of a human who is generating language, in the best way they can, given how they are constructed. Therefore, throughout this report, we compare how humans and LMs approach these tasks, so that we, as humans, can better understand how machines understand language. The rest of the report is organized as follows:
1托马斯·内格尔,《成为一只蝙蝠是什么感觉?》, 《哲学评论》第 83 卷,第 4 期(1974 年 10 月):435-450。
1 Thomas Nagel, “What Is It Like to Be a Bat?” The Philosophical Review 83, no. 4 (October 1974): 435-450.
2以下是商业环境中 NLP 系统的一个很好的指南:Sowmya Vajjala 等人编写的《实用自然语言处理》 (O'Reilly)。
2 Here’s one good guide to NLP systems in a business setting: Practical Natural Language Processing by Sowmya Vajjala et al. (O’Reilly).
3 “阅读” 加引号是为了强调语言模型的阅读方式与人类不同。
3 “Read” is in quotes to emphasize that language models don’t read in the same way humans do.
4 LM 的输入和输出当然可以包含标点符号。
4 LM inputs and outputs, of course, can include punctuation.
5对于大型语言模型,需要大量的训练示例。
5 For a large language model, many training examples are necessary.
6有关该类比的学术渊源,请参阅Tomas Mikolov 等人撰写的《向量空间中词语表征的有效估计》 。如果您对词汇类比的进一步学术挖掘感兴趣,请参阅 Aleksandr Drozd 等人撰写的《词嵌入、类比和机器学习:超越国王——男人 + 女人 = 女王》。
6 For the academic origins of the analogy, see “Efficient Estimation of Word Representations in Vector Space” by Tomas Mikolov et al. If you’re interested in further academic digging on word analogies, see “Word Embeddings, Analogies, and Machine Learning: Beyond King - Man + Woman = Queen” by Aleksandr Drozd et al.
7当通过大量例子进行学习时尤其如此。
7 This is especially true when learning on many examples.
8仅举几个例子,既可以作动词又可以作名词的词包括“跑”、“走”、“击球”和“投掷”。
8 For just a few examples, words that can be either a verb or a noun include “run,” “walk,” “bat,” and “throw.”
如果你读过关于神经网络的文章,你可能知道它们之所以被称为神经网络,是因为它们的总体结构受到了观察到的生物神经元行为的启发。然而,这并不意味着构建一个用于执行机器学习任务的神经网络的工作方式与人脑相同。
If you have read about neural networks, you are probably aware that they are so called because their general structure is inspired by the observed behavior of biological neurons. However, this is not the same as saying that a neural network constructed to perform machine learning tasks works in the same way as a human brain.
人工神经网络旨在模仿人脑的某些功能。但正如吃虫子和使用声纳并不能让人类了解蝙蝠的感受一样,学习预测下一个单词也无法让语言模型(LM)了解人类的感受。这两者是双向的:我们的行为启发了神经网络架构,并且观察到的结果存在收敛性(一个单词确实被预测出来了),但这并不意味着我们知道语言模型(LM)的感受。
An artificial neural network is intended to mimic certain things the human brain does. But in the same way that eating bugs and using sonar cannot let a human know what it is like to be a bat, learning to predict the next word does not allow an LM to know what it is like to be a human. This goes both ways: the fact that our behavior inspires neural network architecture, and there is convergence in the observed outcome (a word is, indeed, predicted), does not mean that we know what it is like to be an LM.
因此,在本报告中,我们首先会通过批判性地观察人类如何处理语言、记忆和沟通,来介绍神经网络语言模型及其后续技术。培养对神经网络语言模型基本构建模块的良好直觉至关重要,这样我们才能应对日益复杂的、日益成为语言建模前沿的模型。
We therefore introduce neural network language models, and all subsequent techniques in this report, by first making critical observations of how humans approach language, memory, and communication. It is important to develop good intuition about the basic building blocks of a neural network language model so that we may take on the increasingly complex models that are coming to the forefront of language modeling.
为了传达这些基本概念,我们将从人类通过反复试验学习烘焙的过程开始。在接下来的示例中,我们会做出一些假设,以便更好地理解学习行为本身,而不是定义学习的根本过程。我们将重点介绍一些具体的课程,以激发对神经网络的技术介绍(当然,还有更多蝙蝠!),重点是掌握循环神经网络及其变体。
To communicate these underlying concepts, we will begin with the very human process of learning to bake through trial and error. In our upcoming example, we make certain assumptions to better understand the act of learning, rather than the underlying process that defines learning. We highlight some concrete lessons to motivate the technical introduction to neural networks (with more bats!), with a focus on getting a handle specifically on Recurrent Neural Networks and their variations.
进食是人类生存的本质之一。进食,往坏了说是必需,往好了说是愉悦。几千年来,人类通过符号和语言传递着信息。这些信息以口耳相传的方式流传了一段时间,最终被记录下来。如今,我们拥有数百至数千年前关于各种事物的文字记录,但为了简洁起见,我们先来谈谈烹饪。1
An essential aspect of what it is to be human is to eat. Eating is an act of necessity at worst and an act of pleasure at best. Humans, as we would have it, have passed down information for thousands of years through the use of symbols and language. This information persisted orally for some time and ultimately was written down. We now have written records communicating information from hundreds to thousands of years ago about all kinds of things, but for brevity, let’s focus on cooking.1
在这种情况下,我们设想自己处于最基本的形态:我们记忆信息的能力有限,缺乏经验,也无法将外部信息嵌入我们的知识库。在这种有限的状态下,我们必须找到一种积累经验并将其运用的方法,以实现我们的目标,并提升我们更好地实现这一目标的能力。为了探索在这种有限状态下学习的方法,我们将遵循一个秘诀。重申一下,在我们当前抽象的形态下,我们的记忆力有限,嵌入额外信息的能力也有限(我们只知道我们能读懂的东西),我们需要积累经验。以下是我们的秘诀和配料表:
In this scenario, we imagine ourselves in our most basic form: we have limited capacity to hold information in memory, we lack experience, and we are unable to embed outside information into our knowledge base. In this limited form, we must find a way of gaining experience and carrying it forward to accomplish our goal, and to increase our ability to accomplish this goal well. To explore the methods of learning in this limited state, we’ll follow a recipe. To reiterate, in our current, abstracted form, we have limited memory and limited ability to embed additional information (we know only what we can read), and we need to gain experience. Here is our recipe and ingredient list:
指示:
Instructions:
抓取食材。2
Grab ingredients.2
将烤箱预热至 350°F (175°C)。在 9×9 英寸的烤盘上涂抹油脂和面粉,或在松饼烤盘上铺上纸垫。
Preheat the oven to 350°F (175°C). Grease and flour a 9×9-inch pan or line a muffin pan with paper liners.
在中等大小的碗中,将糖和黄油搅拌在一起。
In a medium bowl, cream together the sugar and butter.
逐个打入鸡蛋,然后加入香草搅拌。
Beat in the eggs, one at a time; then stir in the vanilla.
将面粉和发酵粉混合,加入奶油混合物中,搅拌均匀。
Combine flour and baking powder, add it to the creamed mixture, and mix well.
最后,加入牛奶搅拌至面糊顺滑。将面糊倒入或用勺子舀入准备好的锅中。
Finally, stir in the milk until the batter is smooth. Pour or spoon the batter into the prepared pan.
在预热好的烤箱中烘烤30至40分钟。纸杯蛋糕烘烤20至25分钟。蛋糕摸起来有弹性就表示烤好了。
Bake for 30 to 40 minutes in the preheated oven. For cupcakes, bake for 20 to 25 minutes. Cake is done when it springs back to the touch.
尽管我们还不知道,但我们最终会进行多次尝试,因此我们将从蛋糕 0 开始,用数字来指代每一个。
Although we don’t know it yet, we will end up making multiple attempts, so we’ll refer to each one numerically, starting with Cake 0.
对于蛋糕 0,你开始阅读说明。你会看到总共七个按顺序排列的步骤。你从步骤 0 开始,因为它首先出现。你拿起所有原料,阅读说明,心想:“谁需要这些?反正它们最终都会放进碗里。” 所以,除了拿起原料(步骤 0)之外,你认为你可以按照最适合自己的顺序进行操作。
For Cake 0, you begin reading the instructions. You see a total of seven steps, in sequence. You start at step 0, as it appears first. You grab all the ingredients and read the directions and think, “Who needs these? It all ends up in the bowl anyway.” So besides grabbing the ingredients (step 0), you think that you can just do what you want in the order that suits you best.
你觉得混合面粉和泡打粉(步骤4)听起来最简单。你把它们倒在一起,然后去拿打好的奶油混合物。令你惊讶的是,你似乎找不到它。于是你读了步骤4之前的步骤:“3. 逐个打入鸡蛋;然后加入香草精。” 你照做了,然后把它加到面粉和泡打粉里。你以为快完成了,这时你注意到还有一些未用到的材料——糖、黄油和牛奶。你决定一次性把它们全部加进去搅拌。
You think that combining the flour and baking powder (step 4) sounds easiest. You dump them together and go to grab the creamed mixture. To your surprise, you cannot seem to find it. So you read the step before step 4: “3. Beat in the eggs, one at a time; then stir in the vanilla.” You do this, and add it to your flour and baking powder. You think you’re about finished when you notice you have unused ingredients—namely, the sugar, butter, and milk. You decide to just add them all in at once and stir.
你最终的混合物看起来不错,准备放进烤箱。烤箱不热,你又看了一遍说明书,发现第一步提示预热烤箱。你怒不可遏,纳闷为什么说明书没有在第0步写“按步骤操作”!你等了几个小时才等到蛋糕烤好,结果糟透了。你告诉自己,好吧,下次我一定按步骤操作。
You have a final mixture that looks fine and go to put it in the oven. It is not hot, so you read the instructions again, seeing that step 1 tells you to preheat the oven. Furious, you wonder why the instructions didn’t say “Instructions are in order” as step 0! You wait a couple of hours for your cake to finally be done, and it sucks. You tell yourself, fine, next time I will follow the instructions in order.
对于蛋糕1,你决定按照明确的顺序进行操作。你按照步骤0、步骤1、步骤2和步骤3进行操作。最后你到了步骤4:“将面粉和泡打粉混合,加入打好的奶油混合物中,搅拌均匀。” 你四处寻找打好的奶油混合物,却仍然找不到。你又一次因为说明书撒了谎而怒不可遏,于是把面粉和泡打粉也加到了打好的奶油混合物中。反正都一样。
For Cake 1, you decide that you will follow the instructions in explicit order. You do step 0, step 1, step 2, and step 3. You get to step 4: “Combine flour and baking powder, add it to the creamed mixture, and mix well.” You look around for the creamed mixture and still cannot find it. Livid that the instructions lied again, you add the flour and baking powder to the beaten mixture. It’s all the same anyway.
你基于上一步的成功完成了第四步(没有用该死的奶油混合物),但到了第五步(“最后,加入牛奶搅拌至面糊顺滑。将面糊倒入或用勺子舀入准备好的烤盘。”),你却找不到“准备好的烤盘”,而“面糊”又是什么呢?你拿起“9×9英寸烤盘”,把所有搅拌好的东西都倒进碗里。你把它放进“预热”的烤箱里,30分钟后,一个美味的蛋糕就出炉了!
You complete step 4 based on your success from the previous step (without the darned creamed mixture), but at step 5 (“Finally, stir in the milk until the batter is smooth. Pour or spoon the batter into the prepared pan.”), you can’t find the “prepared pan,” and what the heck is the “batter”? You grab your “9×9–inch pan” and pour in everything you’ve been mixing in the bowl. You place it into your “preheated” oven, and 30 minutes later you get a delicious cake!
按照步骤操作,你成功地做出了蛋糕!你还得出结论,“打发的混合物”肯定是“打发的混合物”,“碗里的东西”是“面糊”,“9×9英寸的烤盘”和“准备好的烤盘”是一样的。既然你已经成功地做出了蛋糕,你觉得现在没有什么可学的了,所以你努力把所有需要知道的东西都记在脑子里。然而,一些错误会不时地出现在你的脑海里,提醒你你不知道这些东西叫什么,你真希望自己知道哪些东西真正重要,哪些不重要(谁会在乎到第5步它就叫“面糊”呢?)。所以,让我们学习如何忘记。
By following the steps in order, you successfully made the cake! You also conclude that the “beaten mixture” must be the “creamed mixture,” the “stuff in the bowl” is the “batter,” and the “9×9–inch pan” is the same as the “prepared pan.” You feel like there is nothing left to learn now that you can successfully make the cake, so you try to commit everything you need to know to memory. However, some mistakes keep popping up in your head every now and then, reminding you that you didn’t know what things were called, and you wish you knew what was actually important to remember and what wasn’t (who cares that by step 5 it’s called “batter”?). So let’s learn how to forget.
现在,我们来谈谈蛋糕2。你决定一周后再做一次,但不幸的是,除了配料表,你把所有东西都弄丢了。你试图回忆整个食谱,但只记得三件事:
Now, for Cake 2. You decide to make the cake again a week later but, unfortunately, you lost everything but the ingredient list. You try to recall the entire recipe but can remember only three things:
该食谱的制作顺序是先将烤箱预热至 350°F,然后准备一个涂有油脂的平底锅。
There is an order to the recipe that begins with preheating the oven to 350°F and getting a greased pan.
应先混合湿成分,然后再混合干成分,最后形成“面糊”。
The wet ingredients should be mixed before the dry ingredients, finally ending up with the “batter.”
面糊放入烤箱,直到变成蛋糕。
The batter goes into the oven until it turns into a cake.
你准备好所有材料,将糖和黄油混合在一起,然后加入鸡蛋和香草精,再加入牛奶,最后加入面粉和泡打粉,调成“面糊”(理想情况下,牛奶应该最后加入,但这样做也不会造成太大问题)。你把面糊倒入之前抹油的烤盘,烘烤30分钟。打开烤箱,扑面而来的蛋糕香味扑鼻而来。
You grab your ingredients and proceed to add the sugar and butter together, then add the eggs and vanilla, then the milk, and finally the flour and baking powder, producing the “batter” (you ideally should have added the milk last, but this does not cause undue problems). You add the batter to the pan you greased at the beginning, and bake it for 30 minutes. Upon opening the oven, you are greeted by the delicious smell of your cake.
你不再受说明书的束缚!你只需要记住几个关键步骤的顺序、各种原料的组合,以及将它们混合在一起,就能得到制作蛋糕的面糊!你意识到你不需要记住整个食谱;你可以将基本原理与原料清单结合起来,做出一个蛋糕!凭借你新学到的烘焙技巧,你就能鱼与熊掌兼得。
No longer are you limited by the instructions! All you had to remember was the order of a few key aspects, the combination of ingredients, and that by mixing them together, you get batter that makes a cake! You realize that you didn’t need to remember the entire recipe; you could use an underlying pattern in combination with the ingredient list to make a cake! With your newfound baking skills, you can have your cake and eat it too.
你坐下来吃蛋糕,一边回忆烘焙的点滴,一边思考自己学到了什么。你发现,随着时间的推移,你可以学习有序的信息,因为序列本身就具有模式。你还了解到,事物之间的关系可以随着时间的推移保持不变,如果你理解了其中的内在模式,即使你记住一些信息,而忘记其他信息,也不会造成太大的影响。3
As you sit and eat your cake, you reminisce on your baking experience and ponder what it is that you have learned. You see that you can learn information in terms of order over time, because sequences have patterns. You have also learned that relationships between things can remain set over time and that you can retain some pieces of information and forget other pieces with little consequence if you understand the underlying pattern.3
现在你可能会想,如果你能教一台机器做蛋糕会怎么样?它难道不需要知道你遵循的烘焙抽象模式吗?你猜真正的问题是,你该如何开始教一台机器学习这种行为?
It is now that you think, what if you could teach a machine to make a cake? Wouldn’t it need to know only the abstract pattern of baking that you followed? You guess the real question is, how would you even begin to teach a machine how to learn that behavior?
让我们使用美国内政部“关于蝙蝠的 13 个惊人事实”中的这句话作为示例,其中详细介绍了各种大小的蝙蝠:
Let’s use this sentence from the US Department of the Interior’s “13 Awesome Facts About Bats” detailing various sizes of bats as our running example:
蝙蝠的体型大小不一,从体重不到一便士的基蒂猪鼻蝠(也称为熊蜂蝠,是世界上最小的哺乳动物)到翼展可达 6 英尺的狐蝠。
Bats range in size from the Kitti’s hog-nosed bat (also called the bumblebee bat) that weighs less than a penny—making it the world’s smallest mammal—to the flying foxes, which can have a wingspan of up to 6 feet.
想象一下,你读了很多关于这种有趣哺乳动物的资料,现在准备玩一个预测下一个单词的游戏。给出“蝙蝠的活动范围”的提示,你会想到几个合理的单词——“长度”、“重量”、“大小”、“颜色”等等——然后你会从中选择一个。本章的剩余部分将讨论神经网络如何模拟这种结果,并选择一个单词来跟随给定的序列。
Imagine that you have done a lot of reading on this interesting mammal and are going to play the game of predicting the next word. Given the prompt of “Bats range in,” you will think of several reasonable words to follow—“length,” “weight,” “size,” “color,” and so on—and you will choose one of them. We will spend the rest of this chapter discussing how a neural network would mimic this outcome and choose a word to follow a given sequence.
作为参考,回想一下概率语言模型,它在 “语言模型如何表示语言?”中有简要描述。三元模型会为可能跟在短语“Bats range in”(以及其他三元模型)后面的每个单词的非零概率创建一个表示,并输出最可能的单词,如图2-1所示。
As a reference, recall the mention of probabilistic language models, briefly described in “How Does an LM Represent Language?”. A trigram model would create a representation for the nonzero probabilities of every word that might follow the phrase “Bats range in” (as well as other trigrams) and output the most likely one, as shown in Figure 2-1.
那么,神经网络如何预测下一个单词?
So, how does a neural network predict the next word?
我们不会回过头去介绍人工神经网络,讨论它们的总体结构是如何受到观察到的生物神经元行为的启发等等。让我们直接讨论一个简单神经网络的基本结构,如图2-2所示。
We are not going to walk back all the way to introducing artificial neural networks, discussing how their general structure was inspired by the observed behavior of biological neurons, and so on. Let’s jump straight to discussing the basic structure of a simple neural network, as shown in Figure 2-2.
需要明确的是,神经网络并不等同于语言模型 (LM)。我们讨论的是以神经网络形式构建的语言模型,但神经网络可以执行许多其他任务,例如图像处理领域。人工神经网络包括以下内容:
To be clear, a neural network is not equivalent to an LM. We are discussing language models that are constructed as neural networks—but neural networks can work on many other tasks, such as in the area of image processing. An artificial neural network includes the following:
使用适当结构化的输入集。
Consumes an appropriately structured set of inputs.
对输入应用一组转换。
Applies a set of transformations to the inputs.
有时也称为激活层,它将前面转换的结果转换为产生最终输出的东西。
Sometimes also referred to as the activation layer, this converts the result of the preceding transformations into something for producing the final output.
输入层将每个单词(或标记)转换为神经网络其余部分可以理解的向量表示——也就是我们在上一章讨论过的词嵌入!人们通常会在这里使用一组现有的预训练词嵌入,例如Word2Vec、GloVe或ELMo。
The input layer converts each word (or token) into a vector representation that the rest of the neural network can understand—namely, the word embedding we discussed in the previous chapter! Frequently, people will use an existing set of pretrained word embeddings here, such as Word2Vec, GloVe, or ELMo.
在本章中,我们将通过仅考虑一个隐藏层来简化问题。实际上,深度神经网络之所以被称为深度神经网络,是因为它们堆叠了多个隐藏层,并且每层学习到不同的权重。在此步骤中,网络将一组权重应用于嵌入。这些权重的值是模型的参数。该模型的训练方法是: 输入一段文本(“蝙蝠的范围”),将其生成的输出与文本中的真实值(“大小”)进行比较,并根据接近程度调整其隐藏层的权重。此过程重复多次,每次都会略微改变隐藏层的状态。4希望您现在对这样一个模型需要处理多少文本以及需要调整多少参数才能构建一个质量不错的表示有了一定的了解!
We will simplify matters in this chapter by considering only one hidden layer. In reality, deep neural networks are so called because they stack multiple hidden layers, with different weights learned in each layer. In this step, the network applies a set of weights to the embeddings. The values of these weights are the parameters of the model. This model is trained by consuming a snippet of text as input (“Bats range in”), comparing its generated output to the true value in the text (“size”), and adjusting the weights in its hidden layer depending on how close it got. This process is repeated many times over, slightly altering the state of the hidden layer each time.4 Hopefully, you are now gaining some intuition about just how much text such a model needs to consume, and how many parameters it needs to tweak, in order to construct a decent-quality representation!
在隐藏层应用转换步骤后,神经网络已完成大部分工作。但隐藏层中的信息无法以其当前形式直接用于任何用途。输出层会问“那又怎样?”这个问题,它使用激活函数(再次借用了生物学术语)来实现这一点,激活函数的作用是将矩阵中的信息聚合成某种可传输的输出。确切地说,激活函数也是深度神经网络中一个隐藏层将信息传递到下一个隐藏层的机制。为简单起见,我们将绕过这部分的讨论,仅关注输出层的激活函数。5
Having applied the transformation steps in its hidden layer, the neural network has done most of its work. But the information cannot be directly used for anything in its current form in the hidden layer. The output layer asks the question of “so what?” It does so using an activation function—again, terminology borrowed from biology—whose job is to aggregate the information in the matrices into some kind of transmissible output. To be precise, activation functions are also the mechanism by which one hidden layer passes information forward to the next in a deep neural network. For simplicity, we will bypass that part of the discussion and focus only on the activation function in the output layer.5
对于我们的任务,我们需要得到“Bats range in”后面所有可能出现的单词的概率分布,就像图 2-1中三元语言模型的概率表一样。概率分布是介于 0 到 1 之间的有限数字集合,其和为 1。因此,我们会应用特定的数学函数(例如 softmax)来生成概率 分布。
For our task, we need to end up with a probability distribution across all possible words that might follow “Bats range in,” just as we have in the probabilities table for the trigram language model in Figure 2-1. A probability distribution is a finite set of numbers between 0 and 1 that sum to 1. Therefore, a particular mathematical function is applied (e.g., softmax) to yield the probability distribution.
人工神经网络的“内部”是什么?是矩阵。图 2-3展示了一个简化的示例。
What is “inside” an artificial neural network? Matrices. Figure 2-3 shows a simplified example.
激活层和聚合层中的值与前面所示的三元语言模型(图 2-1 )的概率表中的值相当。对矩阵执行的操作由箭头表示。因此,例如,当我们谈到对嵌入应用权重时,这实际上是对这两层中的矩阵执行的操作的简写。
The values in the activation and aggregation layer are comparable to the values in the probabilities table of the trigram language model (Figure 2-1) shown earlier. The operations performed on the matrices are implied by the arrows. Thus, for instance, when we speak of weights being applied to the embeddings, this is shorthand for an operation being performed on the matrices in those two layers.
神经网络由什么组成?矩阵,以及在矩阵上执行的运算。
What is a neural network made of? Matrices, and operations defined to be performed on the matrices.
重要的是要理解,这种神经架构不仅限于创建语言模型(LM,一种用于预测下一个单词的系统)。这种神经架构可以承担的其他任务包括词性标注、命名实体识别、句子分类(例如,用于情感分析)等等,更不用说各种各样的图像识别任务了!将该架构应用于这些任务意味着需要改变输出层的功能和预期输出(当然,对于图像任务,输入也需要改变)。我们不会在本报告(专门讨论语言建模)中探讨这些其他方面,但希望避免在这一点上造成任何潜在的混淆。
It is important to understand that this neural architecture is not limited to creating an LM—a system for predicting the next word. Other tasks that this neural architecture can take on include part-of-speech tagging, named entity recognition, sentence classification (e.g., for sentiment), and so on, not to mention the myriad of tasks for image recognition! Applying the architecture to such tasks means altering the functionality of the output layer and the expected output (and for image tasks, the input as well, of course). We are not going to explore these other aspects in this report (which is specifically about language modeling) but wanted to avoid any potential confusion on this point.
到目前为止,我们讨论的是一个前馈、固定宽度的神经网络(之所以这样描述,是因为它输入前一个输入来预测序列中下一个标记或单词,并且输入的大小是固定数量的标记)。回到我们的例句,神经网络会通过“滑动”的方式进行训练,将连续的三元组输入与一元组输出配对:
We have up to now discussed a feed-forward, fixed-width neural network (so described because it feeds in previous input to predict the next token, or word, forward in the sequence, and the size of the input is a fixed number of tokens). Returning to our example sentence, the neural network would train on it by “sliding” along, pairing consecutive trigram inputs and unigram outputs:
“蝙蝠种类繁多”,“大小”
“Bats range in,” “size”
“大小范围”,“从”
“range in size,” “from”
“尺寸从…开始”,“这”
“in size from,” “the”
等等。虽然这缓解了概率n -gram 模型的一些主要问题,例如稀疏性和存储问题,但使用前馈固定宽度的神经网络进行语言建模存在一些严重的缺陷。6出现位置相近的单词会一起影响网络,而出现位置相距较远的单词之间的关系则无法被检测到。
And so on. Although this relieves some of the major problems with probabilistic n-gram models, such as those of sparsity and storage, using a feed-forward, fixed-width neural network for language modeling has some serious drawbacks.6 Words that appear close together will affect the network together, but relationships between words that appear further apart will not be detected.
在我们的例句中,由于中间插入了括号“(也称为大黄蜂蝙蝠)”,网络在处理“重量不到一便士”时会完全忘记“基蒂猪鼻蝙蝠”。换句话说,即使我们将固定宽度的窗口扩大到 5 个标记,模型也能从输入“重量不到一便士”中学习到完全相同的信息,无论它是在我们关于蝙蝠大小的句子中,还是在诸如“那整堆羽毛的重量仍然不到一便士”之类的句子中,尽管人类读者可能会以不同的方式思考这些东西的重量。围绕文本输入的上下文信息被神经网络完全忽略。
In our example sentence, the network will forget all about “the Kitti’s hog-nosed bat” by the time it reaches “weighs less than a penny” because of the intervening parenthetical “(also called the bumblebee bat).” In other words, even if we widened its fixed-width window to say, five tokens, the model would have learned exactly the same information from the input “weighs less than a penny” whether it was in our sentence about bat sizes or in a sentence such as “That whole pile of feathers still weighs less than a penny,” although the human reader will likely think about the weights of these things differently. The contextual information that surrounds the text input is completely ignored by the neural network.
想象一下,一只小型机器人蝙蝠配备了一个人工神经网络,用于学习(并在其中生存)其周围环境。使用这个神经网络就像赋予蝙蝠金鱼般的记忆力。每隔五秒钟,它就会彻底重新评估周围环境,决定在哪里寻找食物、如何躲避捕食者等等。它只会从这段时间内感知到的信息中获取线索,然后立即忘记一切(就像我们烘焙0号蛋糕时那样)。蝙蝠可能感觉饱了,但却不记得自己吃过东西;它可能飞得特别快,却不记得自己正在被主动追赶。我们可怜的蝙蝠如果能获取超越当前时刻的信息,将受益匪浅。
Imagine a little robot bat that has been outfitted with an artificial neural network for learning about (and surviving in) its environment. Using this neural network would be like giving our bat the proverbial memory of a goldfish. Every five seconds, it would completely reevaluate its environment and decide where to look for food, how to avoid predators, and so on. It would take cues only from what it could sense in that time period, and then immediately forget everything (as we did in the baking of Cake 0). The bat might feel full but have no memory of eating; it might be flying extra fast but not remember that it is being actively pursued. Our poor bat would really benefit from having access to information beyond the current moment.
因此,在本章的剩余部分,我们将讨论如何赋予神经网络(以及我们的机器人蝙蝠)更好的记忆力。我们保证接下来的讨论会进行得更快!
So in the rest of this chapter, we cover how to give neural networks (and our robot bat) a better memory. We promise the rest of the discussion will go more quickly!
在训练人工神经网络时,我们每次只会告诉它:“这是现在正在发生的事情”。如果我们可以在输入中添加“这是你刚刚学到的东西”,会怎么样?我们可以。这就是循环神经网络( RNN ),如图 2-4所示。
When training an artificial neural network, all we tell it each time is, “Here is what is happening now.” What if we could add, “and here is what you just learned” to the input? We can. Meet Recurrent Neural Networks (RNNs), illustrated in Figure 2-4.
与人工神经网络相比,循环神经网络 (RNN) 结构的主要区别在于,它在训练时会接受两个输入:当前标记,以及使用前一个标记后的隐藏层(大致是所有参数的值)。通过比较图 2-4和图 2-2,您可以看到 RNN 是如何利用输入序列(而非单个输入)进行训练的。
Compared to an artificial neural network, the main difference in the structure of an RNN is that it takes in two inputs as it trains: the current token, and the hidden layer (roughly, the values of all the parameters) after consuming the previous token. By comparing Figure 2-4 with Figure 2-2, you can see how RNNs are trained by using sequences of inputs, rather than single inputs.
RNN 一次处理一个标记,因此我们称其为对序列的迭代,或时间步长。在我们的示例中,我们有三个时间步长。在时间步长 0,标记“bats”被传递到隐藏层,隐藏层会进行修改以反映输入的消耗。在下一次迭代(时间步长 1)中,标记“range”和上一时间步长结束时的隐藏层状态都作为输入传递。在时间步长 2,标记“in”和时间步长 1 结束时的隐藏层状态都作为输入传递。标记序列现已结束,时间步长 2 结束时的隐藏层状态通过输出层传递,如图 2-5所示。
The RNN processes one token at a time, and hence we speak of iterations over the sequence, or time steps. In our example, we have three time steps. At time step 0, the token “bats” is passed to the hidden layer, which is altered to reflect the consumption of the input. At the next iteration, time step 1, both the token “range” and the hidden layer’s state at the end of the previous time step are passed as input. At time step 2, both the token “in” and the hidden layer’s state at the end of time step 1 are passed as input. The token sequence has now ended, and the hidden layer’s state at the end of time step 2 is passed through the output layer, as shown in Figure 2-5.
请注意,虽然我们目前只关注(并展示)最后一个时间步的输出,但每个时间步都已准备好输出下一个单词的最佳猜测。换个角度来看,RNN 产生的输出序列与输入序列的大小完全相同——我们只是忽略了它。这一事实将在下一章中变得重要。
Note that although we are currently interested in (and showing) the output of only the last time step, every time step is fully ready to output its best guess for the next word. Another way to look at it is that the RNN is producing an output sequence that is exactly the same size as the input sequence—we are just ignoring it. This fact will become important in the next chapter.
RNN 总是产生与输入序列大小完全相同的输出序列。
RNNs always produce an output sequence that is exactly the same size as the input sequence.
一个显而易见的方面(希望如此)是,RNN 可以处理任意长度的输入序列,而且无需指定窗口大小,这无疑是一个优势。理论上,RNN 可以无限传播其隐藏层信息!
One aspect that (hopefully) is immediately obvious is that RNNs can process input sequences of any length, and there is no need to specify a window size, which is certainly an advantage. In theory, an RNN could just keep propagating its hidden layer information forever!
啊,要是真这么简单就好了。实践证明,循环神经网络 (RNN) 无法成功获取太多步骤前的信息。为什么?因为无论模型在输入序列中走了多远,模型的大小(隐藏层的大小)都保持不变。在图 2-5中,表示隐藏层的阴影矩形大小没有增加,即使模型必须记住越来越多的文本信息。这意味着,回到蝙蝠的大小问题,如果我们允许序列与句子一样长,那么预测“蝙蝠的范围在”之后的下一个单词的结构就和预测“蝙蝠的大小范围……高达 6 英尺”(43 个标记)的结构一样。由于同样的原因,这种计算速度也会非常慢:在每个训练步骤中,网络不仅需要调整一次权重,还需要在“时间回溯”到序列的开头时反复调整权重。7
Ah, if only it were that simple. In practice, it turns out that an RNN is not successful at accessing information from too many steps back. Why? Well, the size of the model (the size of the hidden layer) remains exactly the same, no matter how far along in the input sequence the model is. In Figure 2-5, the shaded rectangles representing the hidden layer are not growing in size, even though the model is having to remember information about an increasing amount of text. This means that, returning to our bat sizes, if we allow sequences to be as long as sentences, we are limited to the same structure for predicting the next word after “Bats range in,” as for “Bats range in size…up to 6 feet” (43 tokens). Such computation also turns out to be quite slow, for the same reason: at each training step, the network has to adjust its weights not just once, but over and over as it travels “back in time” to the beginning of the sequence.7
现在,我们已将蝙蝠模型升级为循环神经网络 (RNN),并允许它记住更早一些的事情,例如它最近在进食,或者正在逃避追捕。但它仍然记不住三天前它去过一个理想的觅食地,也无法回头查看,因为这些信息已经被之后发生的各种细节所淹没。同样,它很容易因为一次糟糕的遭遇而受到惊吓,从此再也不会回到理想的觅食地。我们如何才能帮助我们的蝙蝠学会记住重要的事情,忘记不重要的事情呢?
We have now upgraded our bat to an RNN and allowed it to remember things a bit further back, such as that it was recently eating or in the middle of escaping pursuit. But it still can’t remember that it visited a promising feeding ground three nights ago, and to check back, because that information has been crowded out by all the minutiae of everything that came after. Similarly, it is liable to get spooked all out of proportion by a single bad encounter and never return to a good feeding spot again. How can we help our bat learn to remember the important things and forget the unimportant ones?
长短期记忆( LSTM ) 网络以 RNN 架构为基础,并添加了一个持续存在的记忆状态(只要网络正在处理输入序列)。在每个时间步,隐藏层都会访问并修改记忆状态中的信息。图 2-6展示了 LSTM 网络的简化视图。
Long Short-Term Memory (LSTM) networks start with an RNN architecture and add a memory state that persists over time (for as long as the network is processing the input sequence). At each time step, the hidden layer both accesses and alters the information in the memory state. Figure 2-6 shows a simplified view of an LSTM network.
隐藏层和记忆状态之间的相互作用被称为门,因为它们充当信息的守门人,决定什么可以通过,什么不能通过。门与两个任务有关:确定从前一个时间步骤中忘记哪些信息,以及确定在未来记住哪些信息。如果没有遗忘能力,模型就会患上一种超忆症。8回想一下前面关于语言模型学习内容的讨论:它力求在其“任务”中尽可能有效,即对于这样的输入,它应该产生那样的期望输出。模型中的信息并非都与特定任务相关,因此,这些信息除了占用空间和计算时间之外什么也不做。对于人类和语言模型来说,遗忘能力都是记忆的一个重要方面。
The interactions between the hidden layer and the memory state are called gates because they serve as gatekeepers of information, deciding what shall pass and what shall be denied passage. The gates are concerned with two tasks: determining which information to forget from the previous time steps and determining which information to remember going forward. Without the ability to forget, the model suffers from a sort of hyperthymesia.8 Recall the earlier discussion of what an LM learns: it seeks to be as effective as possible in its “task” that for input that is like this, it should produce desired output that is like that. Not all the information in the model is going to be relevant to the specific task, and therefore that information would be doing nothing but taking up space and computing time. The ability to forget is a significant aspect of memory, for both humans and LMs.
在每个时间步处理每个输入时,隐藏层仍在处理预测下一个单词的问题,而门则选择要保留的信息子集。无需赘述任何技术细节(这并非我们的目标),只需说明这些机制也是神经网络,并且以相同的方式进行训练即可。给定一个输入序列,LSTM 单元会同时学习如何预测下一个单词,以及如何记住序列中的重要部分并忽略不重要的部分。
As each input is processed at each time step, the hidden layer is still working on the predict-the-next-word problem, and the gates are choosing which subsets of information to retain. Without getting into any technical detail (which is not our goal), suffice it to say that these mechanisms are also neural networks, and they get trained in the same way. Given an input sequence, the LSTM unit simultaneously learns how to predict the next word and how to remember important parts of the sequence and forget unimportant ones.
对于人类和语言模型来说,遗忘能力都是记忆的一个重要方面。
The ability to forget is a significant aspect of memory, for both humans and LMs.
最后,我们的蝙蝠拥有一个长短期记忆(LSTM)。它可以记住重要的事情,例如特别肥沃的觅食地和躲避捕食者的好地方。蝙蝠也不会被琐事分心,例如它不能吃的植物的位置,或者群体中其他蝙蝠的具体行为。
Finally, our bat has an LSTM. It can remember important things, like a particularly fertile feeding grounds and good places to hide from predators. The bat will also not get distracted by trifles, such as the location of plants it can’t eat or the exact behavior of every other bat in its colony.
本章我们将 RNN 和 LSTM 都描述为单向的,即从左到右处理文本。然而,你可能听说过(或正在考虑)双向RNN。文本中的信息通常可以无序呈现,但这不会给人们带来困扰。人们可以从以下两句话中获得几乎相同的理解:
We have spent this chapter presenting both RNNs and LSTMs as unidirectional, processing text left to right. However, you may have heard of (or are now thinking about) bidirectional RNNs. Information in text can often be presented out of order, but that doesn’t cause problems for people. A person will get pretty much the same understanding from the following two sentences:
这是因为我们的大脑很容易来回移动(就像蝙蝠一样!),并吸收我们刚刚遇到的信息,而无需重新阅读。但是,RNN 坚定地从左向右移动,不会移动。9粗略地说,双向 RNN(或其变体,例如 LSTM)为隐藏层添加了一条从右向左的路径,同时处理原始文本序列和反转文本序列。
This is because our brains easily flit (like bats!) back and forth, and incorporate the information we just encountered without having to reread it. But the RNN, which moves resolutely left to right, does not flit.9 Roughly, a bidirectional RNN (or a variant thereof, such as an LSTM) adds a path going right to left for the hidden layer, simultaneously working on the sequence of the original text and the sequence of the reversed text.
一个直接的实际考虑是,双向 RNN 只适用于能够访问全文的任务!分类和摘要就是很好的例子。在文本生成任务中,模型在移动过程中创建序列,因此显然无法接受尚未完成的反向序列。然而,当双向模型可用时,使用它通常是一个好主意,因为它可以提高性能(而且它的计算需求仍然远低于例如基于 Transformer 的模型,该模型将在报告后面介绍)。
An immediate practical consideration is that a bidirectional RNN will work only for tasks for which you have access to the full text! Good examples of this are classification and summarization. In a text-generation task, the mode is creating the sequence as it moves along, and therefore clearly cannot take in the reverse of this sequence, which is not yet completed. However, when a bidirectional model can apply, using it is frequently a good idea as it improves performance (and it is still far less computationally demanding than, for instance, a Transformer-based model, described later in the report).
RNN,或者说一般意义上的神经网络,并不会将其应用范围限制在语言模型上。神经网络的开发不仅是为了语言建模,也是为了处理文本以外的各种数据,例如图像、音频等等。在本报告中,我们仅关注神经网络架构作为语言模型的应用,但这两者绝不等同。
Nothing about RNNs, or neural networks in general, constrains their applicability to only language models. Neural networks are developed for all sorts of tasks beyond modeling language, and all sorts of data beyond text, such as image, audio, and many others. In this report, we are concerned with the use of neural network architecture only as language models, but the two are in no way equivalent.
并非所有神经网络都是语言模型,也并非所有语言模型都是神经网络。
Not all neural networks are language models, and not all language models are neural networks.
具有类似 RNN 架构的语言模型可用于各种自然语言处理 (NLP) 任务,例如分类和文本生成。我们之所以称之为类似 RNN 的架构,是因为我们仅使用 LSTM 来说明 RNN 基本架构的一种变体,而实践中存在多种变体。虽然它们的数学结构有所不同,但其主要原理是相同的:一个通过神经技术增强的 RNN,用于维持记忆状态。我们在此不再赘述,仅鼓励您注意变体的存在,并根据需要查找更多具体信息。
Language models with RNN-like architectures can be used for a variety of natural language processing (NLP) tasks, such as classification and text generation. We say RNN-like architectures, because we have used LSTMs to illustrate just one variation on the basic architecture of RNNs, and in practice multiple variations exist. While they differ in mathematical structure, the main intuition is the same: an RNN enhanced with neural techniques to maintain a memory state. We will not get into further detail here, and encourage you merely to be aware of the existence of variations and to seek out additional concrete information as needed.
此外,我们无需构建词级神经语言模型。我们可以轻松地构建(就像人类一样)字符级语言模型,它同样可以预测序列中的下一个字符。正如我们所提到的,这种神经架构可以承担的其他任务包括词性标注、命名实体识别、句子分类(例如,用于情感分析)、图像识别和时间序列预测。
Furthermore, nothing requires us to build word-level neural language models. We just as easily can build (as people have) character-level language models, which similarly predict the next character in a sequence. And as we’ve mentioned, other tasks that this neural architecture can take on include part-of-speech tagging, named entity recognition, sentence classification (e.g., for sentiment), image recognition, and time-series prediction.
最后,回想一下我们之前提到的关于拥有多个隐藏层的可能性,将它们堆叠起来,使得一个隐藏层也作为下一个隐藏层的输入(通过激活函数)。实践中,我们已经注意到堆叠几层(大约两到四层)可以提升性能,但堆叠太多层会引发一个我们之前讨论过的常见问题:信息开始消失。这应该会为本报告后面讨论 Transformer 架构时的一个主题提供一些提示:网络中现在有更多连接,而且是在层与层之间!
Finally, recall our earlier aside on the potential of having more than one hidden layer, stacking them so that one hidden layer also serves as the input to the next hidden layer (via an activation function). In practice, it has been noticed that stacking a few layers (somewhere in the two-to-four range) can improve performance, but stacking too many layers introduces a version of a familiar problem, as we’ve discussed: information starts to disappear. This should give a hint as to one topic to come later in the report, when we discuss Transformer architecture: more connections in the network, now between layers!
本章介绍了如何使用神经网络构建语言模型,并重点介绍了两种用于文本处理的架构:RNN 和 LSTM。在继续讨论更复杂的架构之前,我们想先总结以下几个关键点:
This chapter has introduced how neural networks can be used to construct language models and then focused particularly on two architectures for text processing: RNNs and LSTMs. Before moving on to discuss more complex architectures, we would like to leave you with the following key takeaways:
1我希望我不是唯一一个承认我确实曾不小心把东西烧着了、误读了食谱、并且在替换配料时考虑不周的人。
1 I hope I’m not alone in confessing that I have definitely accidentally set things on fire, misread recipes, and engaged in poorly thought-out ingredient substitutions.
2我们故意添加了这个步骤零,因为它没有明确提及,并且从技术上讲,根据我们所说的,如果没有这个步骤,我们就会加热烤箱并完成……
2 We intentionally added this step zero, as it was not explicitly mentioned and, technically, based on what we said, if this wasn’t here, we’d heat up the oven and be done…
3向科幻小说作家菲利普·K·迪克(《仿生人会梦见电子羊吗?》的作者)致歉,感谢他为章节标题所做的贡献。
3 Apologies to science-fiction author Philip K. Dick, who wrote Do Androids Dream of Electric Sheep? for the section heading.
4在这里,更技术性的讨论将涉及梯度和反向传播等术语,以数学方式描述训练期间发生的情况。
4 Here is where a more technical discussion would get into terms like gradients and backpropagation, to mathematically describe what is going on during the training.
5为了更深入地了解技术,您可以阅读输出层中常用的 softmax 函数、隐藏层之间的 ReLU 等;例如,参见Karthiek Reddy Bokka 等人撰写的《自然语言处理的深度学习》 (O'Reilly)。
5 For a deeper technical understanding, you might read about the commonly used softmax function in the output layer, and ReLU between hidden layers, among others; for example, see Deep Learning for Natural Language Processing by Karthiek Reddy Bokka et al. (O’Reilly).
6有关概率n -gram 模型的更多阅读材料,请参阅Daniel Jurafsky 和 James H. Martin 合著的《N-gram 语言模型》 。
6 For additional reading on probabilistic n-gram models, see, for example, “N-gram Language Models” by Daniel Jurafsky and James H. Martin.
7现在,技术讨论可能会检查“消失”或“爆炸”的梯度,这大致指的是由于太多步骤而丢失信息,或者由于序列中某个步骤被赋予了过多的重要性而传播错误。
7 Technical discussions might now examine gradients that either “vanish” or “explode,” which refers roughly to losing information from too many steps back, or propagating errors from one step in the sequence that is given too much importance, respectively.
8 超忆症是一种极其罕见的疾病,患者能够清晰地记住大量生活经历的细节。
8 Hyperthymesia is an extremely rare condition that allows people to be able to remember an abnormally large number of their life experiences in vivid detail.
9当然,可以将 RNN 设置为从右向左移动以适应另一种语言,例如希伯来语,但标准 RNN 需要注意的一点是它的单向性。
9 An RNN can, of course, be set up to move right to left for another language, like Hebrew, but the point to note about a standard RNN is its unidirectionality.
我们已经详细讨论了神经语言模型在预测下一个单词方面的视角,但并未深入探讨为什么这对于模型学习来说如此重要。让我们选择一个更具体、更实际的任务来激发我们对神经网络架构的下一步探索:文本摘要。我们首先思考人类是如何思考文本摘要的。然后,我们将使用相同的示例来介绍编码器-解码器架构,也称为序列到序列(seq2seq)。最后,我们将讨论一些关于编码器-解码器使用的注意事项,并总结本文的关键要点。
We have now discussed in detail the neural language model perspective on predicting the next word, without delving too much into why this is such a valuable thing for models to learn to do. Let’s pick a more concrete and practical task to motivate our next exploration of neural network architecture: text summarization. We begin by considering how humans think about summarizing text. The same working example is then used to introduce the Encoder-Decoder architecture, which is also referred to as sequence-to-sequence (seq2seq). Finally, we offer a few considerations on the use of encoder-decoders and finish with the key takeaways.
我们首先来思考一下人类是如何完成文本摘要任务的。我们将用“你画我猜”游戏来更具体的例子。你们两人一组玩“你画我猜”游戏:你随机选择一个你的同伴没见过的单词,然后你必须画出这个单词的图形,让你的同伴猜对。图 3-1展示了游戏中单词“bat”步骤的一个简单示例。
Let’s begin by considering how humans approach the task of summarizing text. We will make the example more concrete by using the game of Pictionary. You play Pictionary in a team of two: you randomly select a word that your partner does not see, and you must then draw a picture of the word to enable your partner to guess the word correctly. Figure 3-1 shows a simple example of the steps in the game with the word “bat.”
常规的“你画我猜”游戏有时间限制:你的队伍有有限的时间进行绘画和猜测,如果你的同伴猜错,你的队伍就不会得分。但让我们稍微修改一下游戏,创建一个“你画我猜”的总结版。
Regular Pictionary has a time component: your team has a limited time for drawing and guessing, and if your partner fails to guess correctly, you do not pick up a point for your team. But let’s amend the game slightly and create a Summarization version of Pictionary.
现在,给你一段较长的文本(而不是一个单词),你必须将其转换成一幅图画,然后你的队友必须根据这段文字写出一个简短的总结句子——比如说,不超过10个单词。你会在你的图画里写些什么呢?我们来看图3-2中的例子。
Now, you are given a longer chunk of text (instead of a word), which you must transform into a drawing, and your teammate must then generate a short summary sentence from it—say, no more than 10 words long. What would you put in your drawing? Let’s take the example shown in Figure 3-2.
逐步思考你自己的思维过程。记住你的任务:将一张图片解读成一个10字的摘要。你需要尽可能多地向负责撰写摘要的队友传达信息,同时确保队友了解最重要的信息(当然,这对你和队友来说都是主观的)。
Consider your own thought process step by step. You keep the task in mind: interpreting an image into a 10-word summary. You want to communicate as much information as possible to your teammate, who will create the summary, while also making sure your teammate is aware of the most important information (this will, of course, be subjective on both your part and theirs).
在阅读和绘画的过程中,你会判断哪些信息重要且必须传达:必须有一只蝙蝠和某种沙漠植物,并且蝙蝠看起来应该像正在以植物的花朵为食,以暗示其授粉和吸食花蜜。另一方面,除非你和你的队友是世界级的蝙蝠专家,否则你会决定忽略它是墨西哥长舌蝙蝠的事实,这并非因为你们不知道如何具体地画出这个物种。你决定不把这些信息写进你为队友画的画里。总而言之,你正在阅读文本,指出哪些部分很重要,哪些部分可以忽略,然后完成你的画作。
As you read and draw, you make judgment calls about which information is important and definitely needs to be passed along: there must be a bat, and some sort of desert plant, and the bat should look like it is feeding on the flowers of the plant to suggest both pollination and eating nectar. On the other hand, unless you and your teammate are world experts on bats, you decide to ignore the fact that it’s a Mexican long-tongued bat, not in the least because you have no idea how to specifically draw that species. That is information that you decide will not make it into your drawing for your teammate. To summarize, you are reading the text, noting some portions as important and others as to be ignored, and finalizing your drawing.
前面那句话本意是想引导一下,希望能让你想起 RNN 或 LSTM 架构!LSTM 也接受文本序列,选择性地记住和忘记,并细化隐藏表征(比如你的绘画)。但请记住,在每个时间步,LSTM 都会从左到右处理文本,并输出对下一个单词的最佳猜测。因此,在某种程度上,它可以为给定的文本输入序列生成文本输出序列,但存在两个问题:
That preceding sentence was meant to be a bit leading, and, hopefully, to remind you of an RNN or LSTM architecture! An LSTM also takes in a sequence of text, selectively remembers and forgets, and refines a hidden representation (like your drawing). But remember that at each time step, the LSTM is processing the text from left to right and outputs its best guess for the next word. So in some way, it can produce an output sequence of text for a given input sequence of text, but with two problems:
第二个问题可以通过使用双向架构来解决,正如上一章所述。但这并不能缓解第一个问题:对于摘要任务(或许多其他任务,例如翻译)来说,将输出序列的大小与输入序列的大小绑定是没有意义的。
The second problem can be addressed by using a bidirectional architecture, as mentioned in the previous chapter. But this in no way alleviates the first problem: tying the size of the output sequence to that of the input sequence does not make sense for a summarization task (or plenty of other tasks, such as translation).
对于许多任务(例如摘要或翻译)来说,将输出序列的大小限制为等于输入序列的大小是没有意义的。
Constraining the size of the output sequence to be equal to that of the input sequence does not make sense for many tasks, such as summarization or translation.
你的同伴是如何创作出所需的总结句子的?他们观察了你的画作,并生成了一个符合语法的句子,其中包含了他们认为你选择表达的显著信息。就像一个人可以将另一个人的画作转换成文本一样,当一个 LSTM 创建了“画作”后,我们可以用第二个 LSTM 将其转换回文本!
How did your partner create the required summary sentence? They looked at your drawing and generated a grammatical sentence that contained the salient information they believed you had chosen to represent. Just as one human can transform another human’s drawing into text, when the “drawing” is created by one LSTM, we can get a second LSTM to transform it back into text!
这就是编码器-解码器架构的要点:一个 LSTM 用于编码文本,并与另一个 LSTM 连接以进行解码。正如上一章所讨论的,这不仅限于 LSTM。任何处理 seq2seq 任务(例如摘要)的神经架构都可以是候选架构,为了简洁起见,我们在本章中仅提及 LSTM。图 3-3展示了这种方法。
This is the main point of Encoder-Decoder architecture: one LSTM to encode the text, linked with a second LSTM to decode it. As discussed in the previous chapter, this isn’t limited to LSTMs. Any neural architecture that is tackling a seq2seq task such as summarization is a candidate here, and we refer only to LSTMs in this chapter for brevity. Figure 3-3 illustrates the approach.
在这个编码器-解码器架构中,一个 LSTM(编码器)接收待摘要的源文本。处理完“cacti”之后,神经网络的隐藏层表示了从输入序列中收集到的所有信息。第二个 LSTM(解码器)随后以编码器隐藏层的最终状态和一个特殊标记(图中标记为 <START>)作为输入进行初始化,以指示它现在应该开始生成输出。
In this Encoder-Decoder architecture, one LSTM (the encoder) consumes the source text to be summarized. The hidden layer of the neural network after processing “cacti” is a representation of all the information gleaned from the input sequence. A second LSTM (the decoder) is then initialized with input of the final state of the hidden layer of the encoder as well as a special token (denoted in the figure as <START>) to indicate that it should now begin to produce output.
接下来,解码器 LSTM 的工作方式与我们在上一章中讨论的一样,但有一个修改:任何时间步的输入都是前一个时间步的隐藏层,以及它自己在前一个时间步的 token 输出,如图 3-3 中的虚线所示。解码器没有其他依据,因此它必须假设其最佳猜测始终是正确的,并将其用作输入以继续生成序列。
Going forward, the decoder LSTM acts just as we discussed in the previous chapter, with one modification: the input at any time step is the hidden layer from the previous time step, and its own token output from the previous time step, as illustrated by the dotted lines in Figure 3-3. The decoder has nothing else to go on, so it has to assume its best guess is always correct, and uses it as input to continue generating the sequence.
一个合理的问题是,解码器何时停止?就像 <START> 标记一样,词汇表中也有一个特殊的 <STOP> 标记,并且解码器从训练示例中学习到,有时对下一个单词的最佳猜测是声明输出序列已完成(因为所有文本输出最终都应该结束)。因此,当解码器运行时,理想情况下,它最终会生成 <STOP> 标记作为其最佳猜测。1或者,您可以直接指定输出序列的长度(例如,在 10 个标记后停止)。
A reasonable question is, when does the decoder stop? Just like the <START> token, a special <STOP> token is in the vocabulary, and the decoder learns from its training examples that sometimes the best guess for the next word is to declare the output sequence completed (because all text output should eventually come to an end). Therefore, when the decoder runs, it will, ideally, eventually generate the <STOP> token as its best guess.1 Alternatively, you might specify the length of the output sequence directly (e.g., stop after 10 tokens).
本章我们讨论了一个简化的、受 Pictionary 启发的摘要任务。文本摘要通常更实际地应用于大量文本,例如文档甚至文档集合。事实上,编码器-解码器架构可以应用于许多其他任务,只要它们可以定义为序列到序列。示例包括(但不限于)聊天机器人对话、机器翻译和问答。
We have spent this chapter discussing a simplified, Pictionary-inspired summarization task. Text summarization is usually more practically deployed on much greater volumes of text, such as documents or even document collections. In fact, Encoder-Decoder architecture can be applied to plenty of other tasks, as long as they can be defined as sequence-to-sequence. Examples include (but are not limited to) chatbot conversations, machine translation, and question answering.
更进一步说,输入序列和输出序列的信息媒介不必相同。以自动图像字幕任务为例。在这里,编码器-解码器架构将使用卷积神经网络对图像进行编码,2并使用循环神经网络将结果状态解码为生成的单词序列。3
Going further, the medium of information is not required to be the same for the input sequence as for the output sequence. Consider the task of automatic image captioning. Here, the Encoder-Decoder architecture would use a Convolutional Neural Network for encoding the image,2 and a Recurrent Neural Network for decoding the resulting state into a generated sequence of words.3
我们想对示例中使用的文本摘要任务提出一个重要的说明。这种类型的摘要称为抽象文本摘要( ATS ),因为摘要是使用生成模型创建的。现有的另一种文本摘要称为提取文本摘要( ETS ),它直接从原文中选择单词、短语或句子来构建摘要。ETS 使用截然不同的技术,创建文本的中间表示以识别最重要的片段。它不涉及任何自然语言生成。创建的摘要将忠实于原文,但可能非常不符合语法。另一方面,ATS 更有可能构建符合语法的摘要,但不保证其真实性。对于我们的示例文本,一个事实上不正确(并且有点极端)但语法上完全正确的摘要可能是:“有些舌头吃仙人掌中的蜜源植物。”
We would like to make one important comment on the text-summarization task used in our example. This type of summarization is called abstractive text summarization (ATS) because the summary is created by using a generative model. The other type of text summarization that exists is called extractive text summarization (ETS), which directly selects words, phrases, or sentences from the original text to construct the summary. ETS uses quite different techniques, creating an intermediate representation of the text to identify the most important snippets. It does not involve any natural language generation. The created summary will remain faithful to the original text but may be quite ungrammatical. ATS, on the other hand, is far more likely to construct a grammatical summary but provides no guarantee of veracity. A factually incorrect (and somewhat extreme, to make the point) but perfectly grammatical summary of our running example text might be, “Some tongues eat cacti from nectar plants.”
有趣的是,我们可以结合提取式和抽象式摘要的技术来改善结果。其中一种方法是指针生成网络,如果您感兴趣,我们诚邀您阅读相关内容(但请先读完本报告,因为该技术涉及注意力机制,将在下一章介绍)。
Interestingly, it is possible to combine techniques for extractive and abstractive summarization to improve matters. One approach is pointer-generator networks, and we invite you to read up on them if you are interested (but please finish this report first, as that technique involves an attention mechanism, which is introduced in the next chapter).
文本摘要主要有两种类型。抽象文本摘要使用生成模型创建摘要。提取文本摘要通过从输入文本中复制最重要的片段来创建摘要。
There are two main types of text summarization. Abstractive text summarization creates the summary by using a generative model. Extractive text summarization creates the summary by copying the most important snippets from the input text.
现在,我们通过引入编码器-解码器,构建了神经网络的基础。在继续探讨语言建模神经架构的下一组创新之前,我们想总结一下:
We have now built on the basics of neural networks by introducing encoder-decoders. Before we continue with the next set of innovations in neural architecture for language modeling, we would like to summarize:
1这并不是说 <STOP> 一定会出现。解码器可能会出现各种不良行为,比如无限循环一个或多个单词。
1 This is not to say that <STOP> is always guaranteed to show up. The decoder might dissolve into all sorts of bad behavior, such as generating a loop of one or more words ad infinitum.
2就像循环神经网络是一类特别适合处理文本的神经网络一样,卷积神经网络是一类非常适合处理图像的神经网络。
2 Just as Recurrent Neural Networks are a class of neural networks particularly suited to processing text, Convolutional Neural Networks are a class well suited to processing images.
3你能反过来吗?当然!文本到图像的生成,顾名思义,就是:输入文本序列,输出图像(像素序列)。
3 Can you go the other way? Sure! Text-to-image generation is, likewise, exactly what it sounds like: a sequence of text as input and an image (a sequence of pixels) as output.
在本章中,我们将探讨一个虚构的例子,即在英语和一种奇幻语言之间进行翻译,以传达围绕注意力机制的某些基本概念。然后,我们将用更专业的术语讨论注意力机制是如何实现的,并进一步从 Transformer 架构的角度进行推断和分析。本节将应用这些架构的任务是机器翻译。我们将首先观察人类的翻译方法。在探讨了关键概念之后,我们将从技术角度讨论注意力机制和 Transformer 架构的工作原理。最后,我们将讨论这些模型的关键考量因素、它们作为解决方案的适用场景以及本章的要点。
In this chapter, we explore a fictional example of translating between English and a fantasy language to convey certain underlying concepts surrounding attention mechanisms. We then discuss in more technical terms how attention is achieved, which we will then extrapolate and view through the lens of the Transformer architecture. The task that these architectures will be applied to in this section is machine translation. We will begin with some observations about the human approach to translation. Once the key concepts have been explored, we discuss how attention and the Transformer architecture work from a technical standpoint. Lastly, we discuss key considerations of these models, where they are applicable as a solution, and key takeaways from the chapter.
想象一下,你正乘坐邮轮前往巴哈马群岛。你以为已经抵达目的地,船长却用英语告诉你,船似乎在百慕大三角区迷路了,你抵达的是一个未知的岛屿。不幸的是,邮轮燃料不足,无法返回航线,船长请你去和岛民谈谈。
Imagine that you are on a cruise to the Bahamas. Upon what you believe to be your arrival, the ship’s captain informs you, in English, that it seems that the ship has gotten lost inside the Bermuda Triangle, and you have instead arrived at an uncharted island. Unfortunately, the cruise vessel does not have enough fuel to return to course, so the captain asks you to go and speak with the islanders.
听到他们的母语,你却无法辨别出任何你熟悉的语言的相似之处。于是你想出了一个妙招:不如用手指或模仿各种物体和动作,你和岛民一起在沙滩上写下相应的单词。对你来说,他们的书面语言看起来就像一堆以不同形状排列的三角形和圆圈,但它们显然有着相同的词汇和结构概念。
Upon hearing their native language, you are unable to recognize similarities of any language you are familiar with. You come up with a clever idea: why not point to, or mime, various objects and actions, and both you and the islanders will write the appropriate words in the sand. To you, their written language looks like a bunch of triangles and circles arranged in different configurations, but it clearly has the same concept of words and structure.
你指着自己,写下你的名字,然后念出来;岛民也会这样做。你指着船,写下“船”,等等。最终,你把他们的语言映射到几个基本的英语单词上,尽管你自己对如何说、发音或表达他们的语言知之甚少,而且在这个过程中很可能发生了一些误解。你开始尝试把“船需要加油才能离开”这句话翻译成他们的语言。经过反复沟通,你最终表达出了你必须给船加油的基本想法,他们也欣然接受了。
You point at yourself, write your name, and say it; an islander does the same. You point at the boat and write “boat,” and so on. Eventually, you have a mapping of their language to several basic English words, though you have little understanding of how to say, pronounce, or produce their language yourself, and some misunderstandings likely occurred during the process. You get to work trying to translate the English sentence, “The boat needs fuel so it can leave,” into their language. After much back and forth, you communicate the basic thought that you must refuel the boat, to which they happily oblige.
当你终于回到家时,你找到了一本他们语言的字典,并查了一下最终让你回家的那句话:“水龙想要开花;它需要食物。” 1 这不是你的意思,但它达到了预期的效果。你继续翻阅字典,意识到了几件事:
When you finally arrive back home, you are able to find a dictionary with their language and check what was said to ultimately get you back home: “Water dragon wants to bloom; it needs food.”1 This is not what you meant, but it had the intended effect. You keep looking through the dictionary and realize a couple of things:
图 4-1非常简单地说明了这一点。当将“船需要燃料”翻译成“水龙需要食物”时,很容易意识到哪些词是对齐的。
Figure 4-1 illustrates this point extremely simply. When translating “The boat needs fuel” to “Water dragon needs food,” it is easy to realize which words are aligned for the translation.
总而言之,你与岛民的翻译经历有几个值得注意的方面。首先,你每次只专注于翻译文本的一部分,以免无关信息干扰沟通渠道。其次,你使用了一种合理的翻译变体:传达“燃料”最简单的方式是模仿自己吃东西的动作,这是所有生物都会做的事情,所以对于翻译来说,谁在吃东西并不重要,是你还是船。最后,你使用了另一种变体,这种变体只在一种语言中说得通,但在另一种语言中却变得混乱(“离开”的双重含义);尽管如此,最终的整体翻译还是成功的。
To summarize, your translation encounter with the islanders had a few notable aspects. First, you focused on only a subset of the text to be translated at a time, so as not to muddle the communication channel with irrelevant information. Second, you used a variation in translation that makes sense: the easiest way to communicate “fuel” was to mime yourself eating, something all creatures do, and so for the translation it did not even matter who was doing the eating, you or the boat. Finally, you used another variation that made sense in only one language but became muddled in the other (the double meaning of “leave”); still, the overall final translation managed to be successful enough.
现在,让我们从更专业的术语来讨论如何实现注意力机制,特别是从 Transformer 架构的角度来探讨。注意力机制的动机,同样是为了模仿人类的行为,以及我们在处理特定词语时只关注相关语境的能力。如何在神经网络中实现这一点呢?让我们回顾上一章的编码器-解码器架构。
Let’s now discuss in more technical terms how attention is achieved, specifically through the lens of the Transformer architecture. The motivation for attention is, once again, in the desire to mimic human behavior and our ability to focus on only the relevant context when processing any particular word. How can this be implemented in a neural network? Let’s go back to our Encoder-Decoder architecture in the preceding chapter.
简而言之,注意力机制是神经网络中的另一层,它允许解码器同时参考编码器产生的最终隐藏层以及编码器原始输入的选定子集。换句话说,注意力机制是相关上下文的过滤器。在训练网络其他部分的同时,注意力层也会被训练来决定哪些内容是相关的。
Simply put, an attention mechanism is another layer in the neural network that allows the decoder to consult with both the final hidden layer, produced by the encoder, and a select subset of the original input to the encoder. In other words, attention is a filter for the relevant context. The attention layer is trained to decide what is relevant at the same time as the rest of the network is being trained.
注意力机制充当着当前输入文本相关上下文的过滤器。它经过训练,可以与网络的其他部分同时判断哪些内容是相关的。
Attention acts as a filter for the relevant context for the current input text. It is trained to decide what is relevant at the same time as the rest of the network.
图 4-2突出显示了与图 3-3中的编码器-解码器体系结构的主要体系结构差异。解码器仍然获取编码器隐藏层的最后状态,但现在增加了来自注意层的输入,这指示解码器在每个时间步骤中查看的位置。当编码器完成时,隐藏层保存了有关船、燃料以及前者对后者的需求的信息。但是要翻译“船”,考虑句子中的任何其他内容都是不相关的,因此这里的注意层会将焦点完全放在输入“船”上,并屏蔽其余输入。图 4-2中的示例非常短,但不难想象,当输入文本很长且很复杂时,这将变得多么有用。
Figure 4-2 highlights the main architectural differences from the Encoder-Decoder architecture in Figure 3-3. The decoder still gets the last state of the hidden layer of the encoder, but it is now augmented with input from the attention layer, which directs the decoder where to look at each of its time steps. By the time the encoder is finished, the hidden layer holds information about the boat, fuel, and the former’s need of the latter. But to translate “boat,” considering anything else in the sentence is not relevant, and so the attention layer here would bring the focus entirely on the input, “boat,” and block out the rest of the input. The example in Figure 4-2 is extremely short, but it is easy to imagine how useful this becomes when the input text is long and convoluted.
我们选择机器翻译作为解释注意力机制的任务,效仿了通过机器翻译向 NLP 社区介绍注意力机制的学术论文。2
We have chosen machine translation as the task through which we explain the attention mechanism, following in the footsteps of the academic paper that introduced attention to the NLP community via machine translation.2
虽然我们使用翻译任务来引入注意力,但它是一种通用技术,可以用于许多架构,不仅限于编码器-解码器,当然,在机器翻译之外的许多任务中也很有用。
Although we used a translation task to introduce attention, it is a general technique that can be used in many architectures, not only Encoder-Decoder, and, of course, is useful in many tasks beyond machine translation.
一个有趣的持续讨论是,注意力机制是否可解释,即能否为模型的行为提供解释。注意力机制确实在输入和输出之间提供了一种“软对齐”。检查解码器每个时间步的注意力分数,可以得到模型识别的相关上下文的特定解释,该解释以分数在输入标记上的分布形式呈现。因此,你可能会认为,通过参考这种注意力分布,就能理解解码器发出该特定标记的决定。然而,正如深度学习的大多数方面一样,事情远没有那么简单!
An interesting ongoing discussion is whether attention is interpretable, in the sense of providing an explanation for the model’s behavior. Attention does give a sort of soft alignment between input and output. Examining the attention scores at each time step of the decoder shows a particular interpretation of what the model had identified as the relevant context to be in the form of a distribution of scores over the input tokens. Therefore, you might decide that, by consulting this attention distribution, you can understand the decision taken by the decoder to emit that particular token. Alas, as with most aspects of deep learning, things are just not that simple!
首先,注意力机制可能非常嘈杂,而且上下文的相对重要性在看似相似的样本中也可能存在差异。有趣的是,尤其是在分类等任务中,不同的注意力分布仍然可以产生相同的最终预测。3事实上,可以直接干预注意力层,强制模型忽略最受关注的标记,同时仍然产生相同的输出。4
First, attention can be quite noisy, and the relative importance of the context can vary across seemingly similar examples. Interestingly, especially with tasks such as classification, different attention distributions can nonetheless yield equivalent final predictions.3 In fact, it’s possible to directly interfere with the attention layer and force the model to ignore the most “attended to” tokens, and yet still yield the same output.4
这些观点应该会强化我们的直觉,让我们不再依赖注意力机制作为指向“负责”模型输出的标记的可靠指针。再说一次,它并没有进行真正的推理,只有联想。注意力机制充其量只能被认为是对模型决策过程进行合理的重构(它本来可以这样),尽管它无法保证其真实性(究竟发生了什么?)。
These points should strengthen our intuition against relying on attention as a robust pointer to tokens that are “responsible for” the output of the model. Once again, no actual reasoning is going on, only associations. At best, attention can be thought of as giving a plausible reconstruction of the model’s decision process (it could have happened this way), though there’s no guarantee of faithfulness (what really happened?).
注意力可以解释吗?答案既有肯定的,也有否定的,而且这个重要的讨论仍在积极进行中。
Is attention interpretable? There are reasons to answer both yes and no, and this important discussion is actively ongoing.
在编码器-解码器架构中添加注意力层无疑会有所改进。但从英语的角度来看,我们仍然面临一个奇怪的限制。没错,我们试图从一个序列转换到另一个序列,但为什么这些序列需要从左到右处理呢?毕竟,如果我们的例句是“船要离开,所以需要燃料”呢?对于翻译的最终含义而言,先讨论离开还是先讨论燃料真的很重要吗?
Adding an attention layer to the Encoder-Decoder architecture definitely improves things. But we still have a strange—from our English-language perspective—constraint. Yes, we are trying to go from one sequence to another, but why do these sequences need to be processed left to right? After all, what if our example sentence was, “The boat wants to leave and so it needs fuel”? Does it really matter very much, for the final meaning of the translation, whether the leaving or the fuel is discussed first?
在考虑某个特定词语的上下文时,我们希望能够理解整个句子。事实上,我们常常希望超越单个句子。如果我们的示例包含两个句子——“船需要燃料。它想离开。”——我们就必须跨越句子界限才能看出“它”指的是“船”。
We would like to have access to the whole sentence when considering the context for a particular word. In fact, we often want to go beyond a single sentence. If our running example were two sentences—“The boat needs fuel. It wants to leave.”—we would have to cross sentence boundaries to see that “It” refers to “The boat.”
然而,回想一下,我们最初介绍神经网络时,就指出了使用词窗口的问题,因为词窗口会抓取一定数量词元内的所有上下文。那么,我们如何才能既保留序列到序列的处理过程,又能兼顾处理长序列和注意力层的明显优势,同时又能摆脱固定宽度窗口的限制以及 RNN 架构的强制单向处理呢?
Yet, recall that when we first introduced neural networks, we pointed out the issues with using word windows, which grab all of the context within a certain number of tokens away. How can we still keep a sequence-to-sequence process, with the clear benefits of working with both long sequences and an attention layer, while breaking free from both the restrictions of fixed-width windows and the forced unidirectional processing of an RNN architecture?
Transformer 解决了这些问题。下面是 Transformer 架构的大致轮廓(图 4-3):首先,堆叠多个编码器层,其中每一层都是神经网络学习关注自身的实现,通过将关注层反馈到其自己的层(而不是像图 4-2中那样反馈到解码器层)来实现。编码器堆叠之后是一堆解码器层,它们执行相同的操作。更准确地说,每对解码器层之间都有两种注意机制,一种用于关注自身,一种用于关注编码器输出。该模型可以访问完整的输入文本,并且输入和输出序列的长度仍然可以不同。
Transformers address these concerns. Here is a broad outline of the Transformer architecture (Figure 4-3): first, stack multiple encoder layers, where each layer is an implementation of the neural network learning to pay attention to itself, by feeding the attention layer back into its own layers (rather than into a decoder layer as in Figure 4-2). The encoder stack is then followed by a stack of decoder layers doing the same thing. To be a little more precise, two attention mechanisms are between every pair of decoder layers, one for paying attention to itself, and one for paying attention to the encoder output. The model has access to the full input text, and the input and output sequences can still be different lengths.
注意力机制现在可以变得更加巧妙。首先,自注意力机制使模型能够理解特定序列中哪些信息是相关的——无论其顺序如何。其次,我们可以采用多种机制来通知隐藏层,而不是使用一种机制(这称为多头注意力机制)。这种方法试图模拟这样一个事实:一个单词作为另一个单词的上下文出现的原因有很多,远远超出了简单的对齐。
The attention mechanism can now get even trickier. First, self-attention allows the model to understand what information is relevant in specific sequences—no matter their order. Second, instead of one mechanism informing a hidden layer, we can have several (this is called multihead attention). This approach is trying to mimic the fact that there are many reasons for a word to appear as context for another word that go far beyond simple alignment.
与我们之前讨论过的架构相比,Transformer 架构有三个关键方面:
The Transformer architecture has three key aspects, in contrast to the architectures we have discussed previously:
这本身并非一个新概念,你可能还记得报告前面关于堆叠 RNN 以提高性能的简短评论。但确切的堆叠方式有所不同。我们不再像饼干那样简单地将隐藏层堆叠在一起;现在我们做的是三明治。
This is not a new concept by itself, as you may recall the brief comment on stacking RNNs to improve performance earlier in the report. But precisely how the layers are stacked is different. We are no longer just stacking hidden layers on top of each other like a sleeve of crackers; now we’re making sandwiches.
现在,每个编码器都包含一个自注意力层和一个隐藏层,隐藏层是一种注意力机制,用于将当前隐藏层聚焦于前一层的子集。每个解码器都有一个隐藏层和另外两个组件——一个用于自注意力,另一个用于关注编码器。参见图 4-3的简单可视化。
Every encoder now comprises a layer of self-attention as well as a hidden layer, which is an attention mechanism for focusing the current hidden layer on subsets of the previous layer. Every decoder has a hidden layer and two other components—one for self-attention, and the second for paying attention to the encoder. See Figure 4-3 for a simple visualization.
现在有多种方法可以关注词语上下文。自注意力机制现在是多头的。
There is now more than one way to focus on word context. Self-attention is now multiheaded.
为了用一个具体的例子来解释第三点,多头自注意力,我们假设正在翻译“我踢了球”这句话。看一下图 4-4,它展示了我们如何看待注意力信息从一层传播到另一层。
To explain the third point, multihead self-attention, with a concrete example, let’s pretend to be translating the sentence “I kicked the ball.” Take a look at Figure 4-4, which shows how we might think of attention information traveling from one layer to another.
想象一下三种注意力机制并行工作:红色注意力机制关注动作,绿色注意力机制关注谁执行了动作,蓝色注意力机制关注动作的对象是谁(或什么)。那么,对于顶层的“kicked”(踢),红色注意力机制关注与底层相同单词的直接对齐;绿色注意力机制关注“I”(我),蓝色注意力机制关注“ball”(球),三种机制都不太关心冠词“the”(这)。5 在另一个句子“ She opened the window”(她打开了窗户)中,红色注意力机制关注“opened”(打开),绿色注意力机制关注“she”(她),蓝色注意力机制关注“window”(窗户)。
Imagine three attention mechanisms working in parallel: the red one focuses on actions, the green one focuses on who does the action, and the blue one focuses on who (or what) the action is done to. Then, for “kicked” in the top layer, red attention focuses on the direct alignment with the same word from the bottom layer; green attention focuses on “I,” blue attention focuses on “ball,” and none of them cares much about the article, “the.”5 In another sentence, “She opened the window,” red attention focuses on “opened,” green attention focuses on “she,” and blue attention focuses on “window.”
让我们立即回顾上一段。例如,我们实际上不能说绿色注意力机制正在寻找句子中“谁”正在做任何特定动作的语义。同样,语言模型不理解含义,也没有语义概念。这对我们来说是一个有用的简写,可以表示一个足够大的语言模型如何发现与已知语言结构紧密相关的关联。
Let’s immediately walk back the preceding paragraph. We cannot actually state that, for example, green attention is looking for the semantics of “who” is doing any particular action in a sentence. Again, LMs don’t understand meaning and have no concepts of semantics. This is helpful shorthand for us to represent the way in which a sufficiently large language model will discover associations that align closely with known linguistic structures.
我们已经以机器翻译和编码器-解码器架构为例解释了 Transformer 架构,事情已经变得相当复杂了。但我们只差一步:我们是如何从处理 seq2seq 任务的模型,发展到目前最先进的水平的:一个预训练的语言模型(你可能听说过 BERT,即 Transformer 的双向编码器表示),并且可以通过微调(使用更多训练数据进行微调)来处理各种其他任务的?
We’ve explained the Transformer architecture by using the example of machine translation and Encoder-Decoder architecture, and things have already gotten fairly complicated. But we have just one more step to go: how is it that we have also been able to shift from a model that handles seq2seq tasks to the current state of the art: a pretrained language model (you are likely to have heard of BERT, Bidirectional Encoder Representations from Transformers) that can be fine-tuned (tweaked a bit with more training data) to all sorts of other tasks?
Transformer 架构中还有哪些部分在执行“预测下一个单词”的功能?解码器。那么,如果我们舍弃一半的结构,创建一个只包含解码器的架构,会发生什么呢? OpenAI 推出的生成式预训练 Transformer (GPT) 模型在文本生成方面取得了越来越令人瞩目的成果。但现在我们又回到了单向模型!如何训练一个基于 Transformer 的模型,使其能够同时关注单词的左右上下文?
What part of the Transformer architecture is still playing “predict the next word”? The decoder. So what happens if we throw out half of the structure and create an architecture of only decoders? Generative Pretrained Transformer (GPT) models, introduced by OpenAI, are providing increasingly interesting results in text generation. But now we’ve gone back to only a unidirectional model! How can we train a Transformer-based model that looks at the full context of a word, both to the left and to the right?
拿好我的啤酒,BERT 说。7
Hold my beer, says BERT.7
在训练编码器时,BERT 不会尝试预测下一个单词,而是会屏蔽(或隐藏)输入文本中的一小部分,并尝试进行猜测。BERT 就像在玩填空游戏。现在,该模型可以在猜测时观察每个单词的左右两侧相当远的地方,并重复这个游戏数百万次。完成这个(从各个方面来说都是大规模的)训练过程并构建复杂的语言表征后,该模型就可以用于其他任务了。
Instead of trying to predict the next word while training its encoders, BERT masks, or hides, a small subset of the input text from itself and tries to guess it. BERT is playing fill in the blank. Now the model can look quite far to the left and right of each word as it tries to guess, and it repeats this game millions and millions of times. Having completed this (large-scale, in every sense) training process and built a complex representation of language, the model is now ready to be put to work on other tasks.
我们将简要概述微调如何在一个示例任务中发挥作用:句子分类。此微调步骤的训练数据与传统机器学习分类器的数据类似,由句子和标签对组成。输入序列由一个名为 <CLASS> 的特殊标记构成,后面跟着8和句子,输出是标签。BERT 的任务是猜测 <CLASS> 的值。
We will give a brief overview of how fine-tuning works on one example task: sentence classification. The training data for this fine-tuning step looks just like data for traditional machine learning classifiers, comprising pairs of sentences and labels. The input sequence is constructed from a special token called <CLASS>,8 followed by the sentence, and the output is the label. BERT is tasked with guessing the value of <CLASS>.
令人惊奇的是,当在多个任务中使用完全相同的预训练 BERT 模型时,这种方法非常有效。此外,这种方法通常比之前专门为执行这些任务而构建和训练的(结构不同的)模型更成功,同时所需的标记训练数据(用于微调)也少得多。这种 Transformer 架构学习到的语言表征既实用又灵活。它基本上将英语知识从原来的填空训练任务迁移到了新的任务(句子分类或其他任务)。这被称为迁移学习。
The amazing thing is that this approach is quite effective when using the exact same pretrained BERT model across multiple tasks. Furthermore, this approach is frequently more successful than previous (differently structured) models that were constructed and trained to perform specifically on those tasks—while needing significantly less labeled training data (for fine-tuning). The language representation that this Transformer architecture has learned is both useful and flexible. It has transferred its knowledge of English generally from its original fill-in-the-blank training task to the new task (sentence classification or other). This is referred to as transfer learning.
迁移学习是在大量未标记数据上对一个大型语言模型(例如 BERT)进行预训练,并使用少得多的标记数据集对该单一模型进行微调以适应许多任务。
Transfer learning is pretraining one large language model, such as BERT, on a huge amount of unlabeled data, and fine-tuning this single model to many tasks with much smaller sets of labeled data.
终于,我们回到了第一个技术主题:词向量。预训练的 BERT 架构模型可以像 GloVe 或 ELMo 一样用于创建语境化的词向量,并且您可以将这些词向量输入到您的模型中!9这是另一种利用 BERT 的方法,而不是直接将其用作模型并根据您的任务进行微调。
Finally, we have come full circle to our first technical topic: embeddings. A pretrained BERT architecture model can be used just like GloVe or ELMo to create contextualized word embeddings, and you can feed these embeddings into your model!9 This is another way to make use of BERT, in contrast to using it as a model directly and fine-tuning it to your task.
我们已经达到了语言模型(LM)的当前水平,它们规模庞大、功能强大且资源匮乏。在本报告的下一章(也是最后一章)中,在回顾关键要点之后,我们将深入探讨这种现状的影响。
We have arrived at the current state of the art in LMs, and they are large, powerful, and resource hungry. In the next (and last) chapter of this report, following our review of key takeaways, we get into the implications of this state of affairs.
本章介绍了语言建模神经网络架构的最新进展。最后,我们再次强调关于注意力机制和 Transformer 架构的一些重要要点:
This chapter has introduced the most recent advancements in neural network architectures for language modeling. We conclude by reemphasizing some important takeaways around the attention mechanism and Transformer architecture:
Transformer 架构有三个关键方面:
多层编码器和解码器堆叠在一起,每个解码器都单独关注编码器堆栈的输出。
编码器堆栈中的每一对数据与解码器堆栈中的每一层数据之间都夹着一个自注意力层。自注意力层使模型能够理解特定序列中哪些信息是相关的,无论其顺序如何。
在这些自注意力层中,多头自注意力提供了并行的注意力机制,以不同的方式关注词语上下文。
There are three key aspects to Transformer architecture:
Multiple layers of encoders and decoders are stacked, with each decoder paying individual attention to the output from the encoder stack.
A self-attention layer is sandwiched between each pair in the encoder stack and each layer in the decoder stack. Self-attention allows the model to understand what information is relevant in specific sequences, no matter the order.
Inside these self-attention layers, multihead self-attention provides parallel attention mechanisms for focusing on word context in different ways.
1是的,这一切都让人想起《阴阳魔界》的一集。
1 Yes, this is all reminiscent of an episode of The Twilight Zone.
2 Dzmitry Bahdanau 等人, “通过联合学习对齐和翻译的神经机器翻译”, ICLR 2015。
2 Dzmitry Bahdanau et al., “Neural Machine Translation by Jointly Learning to Align and Translate,” ICLR 2015.
3更多详细信息,请参阅Sarthak Jain 和 Byron C. Wallace 合著的《注意力不是解释》,NAACL-HT 2019。
3 For further details, see “Attention Is Not Explanation” by Sarthak Jain and Byron C. Wallace, NAACL-HT 2019.
4有关更多详细信息,请参阅Danish Pruthi 等人于 2020 年撰写的《学习利用基于注意力的解释来欺骗》 。
4 For further details, see “Learning to Deceive with Attention-Based Explanations,” by Danish Pruthi et al., 2020.
5在真实的 Transformer 架构中,注意力机制的某些部分会关注文章以确保语法输出。
5 In a real Transformer architecture, some part of the attention mechanism will care about articles to ensure a grammatical output.
6 Ashish Vaswani 等人, “你只需要关注”, 2017 年。
6 Ashish Vaswani et al., “Attention Is All You Need,” 2017.
7特别感谢 Jay Alammar 和他的博客文章“图解 BERT、ELMo 等”。如果你不熟悉这个表达,可以去 Know Your Meme 上看看“Hold My Beer”的相关内容。
7 A huge hat tip to Jay Alammar and his blog post, “The Illustrated BERT, ELMo, and Co.”, in particular. If this expression is new to you, you can read about “Hold My Beer” on Know Your Meme.
8回想一下上一章中提到的<START>和<STOP>标记。
8 Recall the <START> and <STOP> tokens mentioned in the previous chapter.
9当然,嵌入有多种变体;研究人员一直很忙!
9 Of course, there are variations of embeddings; researchers have been busy!
现在,您已经更好地理解了各种语言模型架构及其相关任务,您可能会想:“是时候收集数据并开始训练了!” 不幸的是,事实并非如此。正如我们所指出的,语言模型学习的是语言的语法、结构及其关联,而不是其背后的含义。那么机器如何阅读和理解呢?让我们通过观察人类行为来探索这个概念。
Now that you better understand a variety of language model architectures and associated tasks, you are probably thinking, “Time to collect data and get to training!” Unfortunately, this is not the case. As we have noted, LMs learn about syntax and structure of language and their associations, but not the underlying meaning. So how would machines read and comprehend? Let’s use an observation of human behavior to explore this concept.
想象一下,你正在观察一个刚刚开始学习说话的小孩。要学会说话,孩子必须练习。这需要他们听单词、说单词、读单词,并参与对话。当然,他们会得到你的指导,从而确定学习的方向。
Imagine that you are watching a young child who is just now learning to speak. To learn how to speak, the child must practice. This requires them to hear words, say words, read words, and engage in conversation. They, of course, will receive guidance from you to inform the direction of their learning.
现在想象一下,你不在他们身边,无法指引他们。相反,他们被带离你,在商场里长大一年。在那里,他们在商场的环境中学习语言,学习词汇来描述他们的感官输入,比如他们所看到的和闻到的。他们回到你身边,现在已经能够流利地说话了。不幸的是,他们似乎对自己所说的话的含义知之甚少。他们只是在商场的环境中学会了如何将单词串联起来,现在他们回家了。
Now imagine that you are not there to help direct them. Instead, they are taken from you and are raised in a mall for one year. There they learn language within the context of the mall, where they pick up the vocabulary to describe their sensory inputs, like what they see and smell. They come back to you, now capable of speaking quite well. Unfortunately, they seem to have little insight into what their words mean. They had picked up only how to string words together in the context of a mall, and now they are home.
你问孩子过得怎么样,他回答说“有什么好吃的吗?”你感到很疑惑,说你过得很好;但是你饿了,所以你想听听他们想吃什么,他们回答说“5号”。你进一步询问,“你想吃鸡翅吗?”孩子回答说“不,5号”。这时,你意识到孩子学会了在商场点餐,但在家里被提示点餐时却不明白。看起来孩子已经学会了把单词串在一起的能力,但在这种新环境下,他们的交流不太有意义。更糟糕的是,他们说话就像一个在商场闲逛的15岁孩子。
You ask the child how they are doing, to which you receive the response, “What’s good?” Puzzled, you say you are doing well; however, you are hungry, so you’d like their input on what they want to eat, to which they respond, “A number 5.” You inquire further, “So you want chicken wings?,” to which the child responds, “No, a number 5.” At this point, you realize that the child learned to order food at a mall but does not understand when prompted at home. It appears that the child has picked up the ability to string words together, but their communication does not make a lot of sense in this new context. Worse yet, they talk like a 15-year-old…who hangs out at the mall.
现在你意识到,你不知道孩子除了理解了什么之外还能理解什么,当你跟他们说话时,他们应该用理论上合理的词语来回应。但现在语境变了,他们有时会说些有道理的话,有时又会胡言乱语。你问:“你为什么在家说话总是不标准?”他们看着你问:“有什么区别?”
Now you realize that you don’t know what the child comprehends, other than that, when you speak to them, they should respond with words that technically make sense. But now that the context has shifted, they sometimes say sensible things and other times say nonsense. You ask, “Why can’t you always talk properly at home?,” and they look at you and say, “What’s the difference?”
你需要大量的数据,但你也需要正确的数据。
You need a lot of data. But you also need the right data.
上述例子中的孩子在没有任何背景知识的情况下走向社会,并学会了说英语。他们能够观察并从环境中提取足够的信息,从而适应他们所熟知的世界。不幸的是,他们在商场里学到的东西,在家里说出来并不总是能很好地翻译过来。对你来说,要改变孩子的语言,你还有很长的路要走。你不可能追溯商场生活对他们理解英语结构和含义的所有影响,而重新训练他们需要大量的时间和精力。在未来很长一段时间里,这个孩子很可能会继续说出一些奇怪的、不合逻辑的话。
The child from the preceding example went out into the world with no previous context and learned to speak English. They were able to observe and extract enough information from their environment to make their way through the world as they knew it. Unfortunately, the things they learned to say at the mall do not always translate well when speaking at home. For your part, you have a long struggle ahead of you to change the child’s language. It is impossible to trace all of the influences that the time in the mall had on their understanding of the structure and meaning of English, and retraining them will take significant time and effort. The child is likely to continue issuing strange non sequiturs for a long time to come.
语言模型很乐意用商场一年的口语和书面语进行训练,只要它仍然“在商场”,它就能有效地发挥作用。1但是,当被要求在不同的语境中工作时,语言模型的表现可能会受到影响,而且就像孩子一样,它无法解释自己。如今流行的大型语言模型非常耗时:它们需要大量的训练数据才能执行,而且它们并不关心这些数据是什么。那么,我们如何才能脚踏实地地看待语言模型能够完成什么呢?在本章中,我们最终将深入探讨理解,或者在我们的例子中是自然语言理解。
An LM would be happy to train on a year’s worth of spoken and written language of the mall and would perform effectively so long as it remained “at the mall.”1 But when asked to work in a different context, the language model’s performance may suffer, and—just like the child—it will not be able to explain itself. The large LMs that have become popular today are extremely hungry: they need a lot of training data to perform, and they do not care what that data is. So how do we keep a grounded perspective on what LMs are able to accomplish? In this chapter, we finally dive into understanding, or in our case natural language understanding.
对于人类来说,理解语言或文本的过程通常分为三个部分:处理文本、理解其含义以及与现有知识整合。语言模型(LM),即使是非常复杂的模型,也擅长第一部分,有时还会模仿在第二部分和第三部分观察到的一些行为,尽管它们无法有意识地推理文本的含义。
For humans, the process of understanding language, or text, generally has three parts: process the text, understand its meaning, and integrate with existing knowledge. LMs, even the complex ones, are good at the first part and sometimes mimic some of the behavior observed during the second and third parts, though without being able to intentionally reason about meaning.
尽管 LM 不执行任何推理,但它们经常表现得好像已经执行过推理一样。
Although LMs do not perform any reasoning, they frequently can appear as if they have done so.
让我们举一个具体的例子,回到预测下一个单词的生成任务。我们在网上找到了一个 GPT 模型的演示。我们尝试了以下简短的实验:
Let’s take a concrete example, returning to the generative task of predicting the next word. We found an available demo online of a GPT model.2 We tried the following short experiment:
让我们从最酷的部分开始。GPT 语言模型通过列出城市完成了一个不完整的句子;列出的城市罗马确实是意大利的一个城市;罗马与意大利的关系就像巴黎与法国的关系一样(它们都是首都)。所有这些都无需任何推理或外部知识!接下来,输出中的其余句子都与城市相关,包括食物、天气和文化。该模型没有输出诸如“我喜欢这两个城市的卧室”之类的内容。该模型只列出了与城市相关的方面。当然,其余句子都是完整的、语法正确的,并且与提示一样保持了第一人称。
Let’s start with the cool part. The GPT language model completed an incomplete sentence by listing a city; the city listed, Rome, is indeed a city in Italy; and Rome has the same relationship to Italy as Paris does to France (they are both capitals). All this with no reasoning or external knowledge! Next, the rest of the sentences in the output are all sentences that could reasonably be things to say about cities, which have food, weather, and culture. The model did not output something such as, “I like the bedroom in both of them.” The model listed only aspects that make sense for cities. And, of course, the rest of the sentences are complete, grammatically correct, and remain in the first person, as the prompt did.
另一方面,输出结果并不十分有趣,而且很快就开始重复。在“最喜欢的城市”这个话题上,模型除了单调地列出两个城市或两个城市中每个城市都喜欢的方面之外,没有任何有趣的进展。我们可以假设各种各样的情况:这个提示太短了,或者,由于提示中的句子很短,模型也倾向于使用短句作为输出。但我们无法弄清楚数百万甚至数十亿个参数所形成的关联究竟发生了什么。
On the other hand, the output is not very stimulating and quickly starts to repeat itself. The model does not go anywhere interesting with this topic of favorite cities beyond monotonously listing aspects that are liked in each of the two of them or in both of them. We might hypothesize all sorts of things: that this prompt was too short, or, because the sentences in the prompt were short, the model took a cue to also prefer short sentences for the output. But we can’t figure out what happened with the associations formed by the millions—or billions—of parameters.
我们或许会想,既然模型正确地将罗马识别为与巴黎平行的城市,它就能够正确处理类比。好的,我们试试看。
We may want to think that, since the model correctly identified Rome as the parallel to Paris, it would be able to handle analogies correctly. OK, let’s try.
罗马和意大利的关系在开头再次被正确地呈现。接下来的关系可能还好,也可能不好:纽约不是美国的首都,但绝对是美国最大、最具文化意义的城市之一(就像巴黎和罗马一样)。但进一步说,文本变得非常迷幻,先是说国家有首都是老生常谈(啊?),然后又猜测如果一个国家是另一个国家的首都会发生什么(这不可能),最后还出现了一些带有历史意味的胡言乱语。而且,即使句子的长度和类型各不相同,语法也完全正确。
Once again, the Rome-Italy relationship comes up correctly in the beginning. The next relationship may or may not be OK: New York is not the capital of the United States but is definitely one of the largest and most culturally significant cities (like Paris and Rome). But going further, the text gets really trippy, first calling countries having capitals a truism (huh?), wondering what would happen if one country were the capital of another (not possible), and winding up with some absolute nonsense with a historical flavor. And it’s all wonderfully grammatically correct, even with sentences of different lengths and types.
还有一点需要注意:如果您尝试使用与我们相同的提示运行相同的模型,您可能会得到不同的生成输出。3有些输出看起来质量更好,有些则更差,我们强烈建议您尝试任何可用的演示,直到您获得足够的轶事数据点来满足您自己的直觉。但对于模型来说,就像对于商场里的孩子一样,这一切都只是“有什么区别?”的问题。就其本身而言,一切都同样有意义。
One more note: if you were to try to run the same model with the same prompts as we did, you would likely get different generated output.3 Some outputs will appear to be of better quality to you, and some of worse, and we strongly encourage you to play around with any available demos until you have enough anecdotal data points to satisfy your own intuition. But to the model, as to the child in the mall, it will all be a case of “what’s the difference?” As far as it’s concerned, it all makes the same amount of sense.
到现在为止,希望你已经对语言模型擅长生成易读文本这一事实感到满意了——但它可能并非完全符合事实,句子的组合也可能显得毫无意义。语言模型本身不进行任何推理,也不具备任何常识或任何它们在训练数据中未曾遇到过的知识。因此,一个强大的端到端系统可能会尝试为语言模型添加其他组件,这些组件可以执行某种推理或整合额外的知识(或两者兼而有之)。
By now you’re hopefully quite comfortable with the fact that LMs are good at producing text that reads well—but it absolutely may not be factually correct, and the combination of sentences may appear to be nonsense. LMs do not perform any sort of reasoning and do not by themselves have any idea of common sense or anything that they have not encountered in their training data. Therefore, a robust end-to-end system may seek to augment LMs with other components that can either perform some sort of reasoning or incorporate additional knowledge (or both).
让我们以一个用于制定旅行计划的对话系统(聊天机器人)为例。在对话的每次迭代中,系统采取的步骤大致如下:处理来自人类用户的输入,识别文本的突出部分,收集信息以形成响应(包括对话早期的信息),并以结构良好的自然语言形式返回答案。语言模型 (LM) 擅长处理第一步和最后一步,并且可以为中间步骤提供有用的信息。但是,当完全依靠它们自己时,它们必须尽可能地处理中间步骤,完全依赖于它们在训练期间形成的联想,这可能相当神秘且不一致,尤其是对于像对话这样复杂的任务而言。
Let’s consider a conversation system (a chatbot) for making travel plans as an example. For every iteration in the conversation, the steps the system takes are roughly as follows: process the input from the human user, identify the salient parts of the text, gather the information to form the response (including from earlier in the conversation), and return an answer in the form of well-structured natural language. LMs are great at the first and last steps and can give helpful information for the middle steps. But when left entirely to their own devices, they must fudge the middle steps as best they can, going entirely off associations they have formed during their training, which may be fairly mysterious and inconsistent, especially for a complex task like conversation.
可以添加到系统中的是与语言模型 (LM) 协同执行推理或咨询外部知识的组件;请注意,推理和知识的类型是否必要取决于您的用例。有许多有趣的变体,因此我们仅简要介绍两种,并鼓励您进一步探索。
What can be added to the system are components that work with the LM to perform reasoning or consult with outside knowledge; note that the type of reasoning and knowledge that is, or is not, deemed to be necessary depends on your use case. Many interesting variations are possible, so we will just briefly mention two and encourage you to look into more.
一种是实体链接的方法。4如果我们拥有一个包含许多已知实体和关系的大型知识库(有时确实如此,具体取决于任务领域),并且结构良好且完全易于理解,那么识别文本中这些实体和关系的存在将大有帮助。以用户话语“我想去意大利。有飞往首都的航班吗?”为例,实体链接过程将识别实体“意大利”和关系“首都”。然后,推理跳跃将在知识库中查询三元组(?,“首都”,“意大利”)中缺失的组成部分并得到结果(“罗马”),从而确定所需的目的地。
One is the approach of entity linking.4 If we have a large knowledge base with many known entities and relationships (and we sometimes do, depending on the task domain), which is well structured and fully understandable, identifying the presence of those entities and relationships in the text can go a long way. Taking the sample user utterance, “I’d like to go to Italy. Are there any flights to the capital?,” the process of entity linking would identify the entity “Italy” and the relationship “capitalOf.” A reasoning leap would then query the knowledge base for the missing component of the triple (?, “capitalOf,” “Italy”) and get the result (“Rome”), thus determining the desired destination.
另一种有趣的方法是创建一个新的实体嵌入空间,供语言模型参考。这里的动机是命名实体解析。考虑三个指代同一城市的短语:“NYC”、“纽约市”和“纽约”(最后一个当然在没有上下文的情况下是模棱两可的)。传统的语言模型必然会学习到每个短语略有不同的表示。预训练的实体嵌入模型可以将它们表示为完全相同,以近似地表示它们指的是同一实体。5在与旅行聊天机器人的对话中,如果人类用户首先提到“从纽约出发旅行”,然后询问“从纽约出发的航班”,这可能有助于系统识别它们指的是同一实体。
Another interesting approach is to create a new embedding space of entities that the language model can consult. The motivation here is named entity resolution. Consider three phrases that all refer to the same city: “NYC,” “New York City,” and “New York” (the last one is, of course, ambiguous without context). A traditional language model will necessarily have learned somewhat different representations of each. A pretrained entity-embedding model could represent these as exactly identical, to approximate the concept that they refer to the same entity.5 In the conversation with the travel chatbot, if the human user first mentions “traveling out of NYC” and later asks about “flights from New York,” this may help the system recognize them as referring to the same entity.
重申一下:对于人类而言,理解文本(书面或口头)需要处理文本、理解其含义并与知识进行整合。语言模型完成了第一步,而一个擅长此道的模型(例如 BERT)是一笔真正的财富,因为它只需构建一次,然后用少得多的数据进行微调,就能成功部署到各种任务中,从问答到翻译再到文本生成。从 BERT 的角度来看,它执行的是文本处理的相同步骤。但是语言模型无法执行推理和知识整合的第二步或第三步。尽管它们有时可以做出惊人的模仿,但它们无法区分正确和错误的信息。
To reiterate: for humans, text (written or spoken) comprehension requires processing text, understanding its meaning, and integrating with knowledge. Language models do the first step, and a model that is great at it (such as BERT) is a real asset, as it can be constructed once and fine-tuned with a whole lot less data for successful deployment on all sorts of tasks, from question answering to translation to text generation. From the perspective of BERT, it’s doing the same step of text processing. But LMs cannot perform the second or third steps of reasoning and knowledge integration. Even though they can sometimes give a startling imitation of doing so, they cannot differentiate between correct and incorrect information.
另一方面,对于部署系统的人来说,了解这一点对于所需任务是否重要也很重要。即使没有任何推理或知识,语言模型仍然可以非常有效地完成某些任务。例如,一个经过足够多问答对训练的语言模型,可以在问答任务上表现出色,足以满足实际应用的需求。
On the other hand, it is important for the person deploying the system to understand whether that matters for the required task. LMs can still be quite effective at particular tasks without any reasoning or knowledge. For example, an LM trained on a sufficiently large number of question-and-answer pairs could do an excellent job at the question-answering task and be quite sufficient for the practical application.
语言模型 (LM) 目前的发展重点在于规模化和适应性。神经网络的基本设计并非近来才出现,矩阵运算和导数的底层数学知识也远早于我们当前的技术时代。语言模型之所以变得庞大而强大,主要有三个原因:本报告中讨论的深度神经网络架构的发展,尤其是基于 Transformer 的模型;计算能力的并行化提升;以及用于训练这些模型的数据越来越多。
The current story of LMs is one of scale and adaptability. The basic design of neural networks is not recent, and the underlying mathematics of matrix operations and derivatives significantly predate our current technological era. LMs have become large and powerful for three main reasons: the development of the deep neural network architectures discussed in this report, especially Transformer-based models; the parallel improvements in computation power; and, finally, the availability of more and more data for training these models.
因此,我们到达了一个有趣的转折点。我们不再仅仅谈论人们使用相同类型的模型进行机器学习任务(SVM、逻辑回归,甚至 RNN)。现在,我们有越来越多的例子表明,人们使用同一个主要语言模型(例如一个特定的预训练 BERT)并针对各种任务对其进行微调,这既是因为这种方法在各个任务中都取得了成功,也是因为构建和训练大型语言模型需要大量的计算和数据资源。
And so we have arrived at an interesting transition point. We are no longer talking only about people using the same types of models for machine learning tasks (SVMs, logistic regressions, even RNNs). We now have more and more examples of people using the same main language model (such as a specific pretrained BERT) and fine-tuning it for all sorts of tasks, both because this approach has been successful across tasks and because of the really high amount of computing and data resources required to construct and train large LMs.
所有这些经过微调的模型都将继承少数大型 LM 的优点和问题。6我们将仅讨论其中的一些:
All of these fine-tuned models are then going to inherit both the benefits and the problems of a small number of large LMs.6 We will touch on just some of these:
以单一模型为基础可能会成为单点故障,无论是由于数据不良、稳健性失败还是实际攻击造成的。
Having a single model as the basis runs the risk of it becoming a single point of failure, whether as a result of bad data, robustness failure, or actual attacks.
使用 BERT 等模型进行的迁移学习,可以有效地让你在网络上进行训练,然后跨领域、跨行业、跨客户地进行微调和部署。这一过程的法律和隐私问题尚不明确,而且相关法规显然还没有跟上技术的步伐。
The transfer learning that is done with models like BERT effectively allows you to train on the web and then fine-tune and deploy models across domains, industries, and clients. The legal and privacy aspects of this process are far from clear, and the regulations certainly have not yet caught up with the technology.
大型语言模型的需求重数量轻质量,重不透明轻透明度。正如报告开篇所述,模型擅长生成可信内容,但却无力维护其输出的真实性。仅仅依靠这些模型做出决策可能很危险,因为它们的输出可能难以解释,而且可能会以意想不到的方式失效。此外,由于输出结果可能过于可信,也存在被滥用的风险,例如制造虚假信息,尤其是在针对性内容方面。
The needs of large LMs emphasize quantity over quality and opaqueness over transparency. As discussed in the beginning of the report, models are great at producing plausible content but have no ability to care about the veracity of their output. It may be dangerous to make decisions solely based on these models, whose outputs can be hard to explain, and which can fail in unexpected ways. Furthermore, because the output can be so plausible, opportunities exist for misuse, such as creating disinformation, especially with targeted content.
训练数据的内在偏差会逐渐显现。特定的社群和观点不可避免地会被过度或不足地表达,或者与特定的观点相关联(例如,研究人员发现,BERT 会将提及残疾人的短语与更多负面情绪词关联起来)。7
Intrinsic biases from the training data will make themselves known. Particular communities and perspectives will inevitably be either over- or underrepresented, or associated with particular points of view (as one real example, researchers have found that BERT associates phrases referencing persons with disabilities with more negative sentiment words).7
创建像 BERT 这样的大型语言模型所需的规模非常巨大。一方面,训练和运行模型所需的投入可能会对财务和环境造成巨大影响。另一方面,很少有人能够获得足够的资源来做到这一点,而这些开发人员在设计和部署方面的选择可能会产生深远的影响。简而言之,我们应该不断思考“什么时候应该构建模型,什么时候不应该构建模型?”
The scale needed to create a large language model such as BERT is large. On the one hand, the financial and environmental impacts of the necessary efforts to train and run the model can be significant. On the other hand, few people will even have access to the resources to do so, and those developers’ choices in design and deployment may have widespread implications. To put it bluntly, we should continuously be asking “when should a model be built, and when should it not be?”
我们希望您已经对语言模型的通用功能有所了解,现在能够放下逻辑细节,与业务中的各利益相关者就高级语言模型概念进行深入的沟通,并且对如何将语言模型应用到您的业务中有足够的信心。最后,我们想分享一些关于人类、机器和语言的想法:
We hope that you have gained an appreciation for the general functionality of LMs, that you will now be able to take a step back from the logistical details and have a thoughtful conversation with the various stakeholders in your business about high-level language model concepts, and that you have reasonable confidence in how to take the next steps toward applying LMs to your business. We would like to leave you with these last few thoughts on humans, machines, and language:
1这可不是天方夜谭!大型语言模型都是用网络上的文本进行训练的,我们知道这些文本的质量参差不齐(就像商场里的一样)。
1 This isn’t a stretch of the imagination! Large language models are trained on text available on the web, and we know what a mixed bag of quality that can be (like the mall).
2该演示在EleutherAI上提供。该模型名为 GPT-J-6B,根据其GitHub repo,它是一个具有 60 亿个参数的版本。
2 The demo is provided at EleutherAI. The model is called GPT-J-6B, and, according to its GitHub repo, is a version with six billion parameters.
3这里可以进行更技术性的讨论。基本上,有时可以调整某个参数来“冻结”模型,以便相同的输入产生相同的输出,但大型模型在选择最终生成的文本的过程中通常会带有随机性。我们测试的特定模型演示确实如此。
3 A much more technical discussion can be had here. Basically, sometimes a parameter can be tweaked to “freeze” the model so that identical input will yield identical output, but frequently large models have an element of randomness in the process of choosing the final generated text. The specific model demo we played around with certainly does.
4我们指的是一个命名实体,通常是人、地点或事物。
4 We are referring to a named entity, which is generally a person, place, or thing.
6要想对这些观点进行有趣且及时(尽管也很冗长且学术)的讨论,请仔细阅读斯坦福大学基础模型(基本上是大型语言模型)研究中心的首份报告“论基础模型的机遇与风险”以及Emily M. Bender 等人撰写的“论随机鹦鹉的危险:语言模型会太大吗?” 。
6 For an interesting and timely, although also lengthy and academic, discussion of these points, peruse the inaugural report from the Stanford University Center for Research on Foundation Models (which are basically large language models), “On the Opportunities and Risks of Foundation Models” as well as “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?” by Emily M. Bender et al.
7请参阅“论随机鹦鹉的危险:语言模型会太大吗?”第 4.3 节。
7 See section 4.3 of “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?”.